wikipedia in the tourism industry: forecasting demand and...

6
Wikipedia in the Tourism Industry: Forecasting Demand and Modeling Usage Behavior Pejman Khadivi Discovery Analytics Center Computer Science Department Virginia Tech, Blacksburg, Virginia Email: [email protected] Naren Ramakrishnan Discovery Analytics Center Computer Science Department Virginia Tech, Arlington, Virginia Email: [email protected] Abstract Due to the economic and social impacts of tourism, both private and public sectors are interested in precisely forecasting the tourism demand volume in a timely manner. With recent advances in social networks, more people use online resources to plan their future trips. In this paper we explore the application of Wikipedia usage trends (WUTs) in tourism analysis. We propose a framework that deploys WUTs for forecasting the tourism demand of Hawaii. We also propose a data- driven approach, using WUTs, to estimate the behavior of tourists when they plan their trips. 1 Introduction Tourism is considered as one of the most profitable indus- tries around the world and, hence, tourism demand forecast- ing is important in various domains such as business plan- ning and assessing economic activity in a region. Various prediction techniques have been used in the literature for de- mand forecasting (Song and Li 2008). Recently, there have been different studies that suggest that the use of search en- gine datasets, such as Google Search Trend data can help improve forecasting accuracy (Bangwayo-Skeete and Skeete 2015; Yang and others 2015). In recent times, online browsing activity has been identi- fied as a precursor to planning travel and visits to far flung destinations. Analyzing such online activity can provide de- tailed information about places of interest, volume of inter- est, and spikes/troughs in attention. Our hypothesis in this paper is that, by looking at Wikipedia usage trends (WUT), we may be able to improve the accuracy of tourism demand forecasts. Wikipedia usage trends have been previously used for forecasting applications in (Hickmann and others 2015) to forecast the Influenza season. Pertaining to the tourism in- dustry, Wikipedia has been used previously for sightseeing recommendation and narrative generation purposes (Fang and others 2015; Hecht and others 2007). However, to the best of our knowledge, there has been no research on using WUTs for tourism demand forecasting. Beside demand fore- casting, another important tourism-related issue is to extract Copyright c 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. the behavior of people when they make decisions about their trips, e.g., how early he/she reads about the National Mall or plans to book his/her flight. The answers to such questions are useful for resource allocation and marketing purposes. In this paper, we aim to use Wikipedia usage trends for the purpose of tourism analysis. The main contributions of this paper are twofold. First, we propose forecasting algorithms to deploy WUTs and Google Search Trends for tourism de- mand forecasting. Second, we propose a framework for the extraction of tourists’ behavior in trip planning. We per- formed various experiments on a Hawaii tourism dataset. Results show that WUTs can improve the accuracy of de- mand forecasting. Furthermore, the extracted tourist behav- iors are consistent with the results of the surveys conducted by tourism departments. Thus, our contributions are: Developing a regression approach that deploys WUTs to improve the accuracy of tourism demand forecasting. Proposing a novel framework for tourist behavior extrac- tion from WUTs and tourism demand time series. 2 Related Work In order to forecast tourism demand, different univariate and multivariate regression methods have been deployed in the literature. Song and Witt (2006) used vector autoregressive (VAR) model to forecast tourist arrivals to Macau. In addi- tion to a tourist arrival time series, they also incorporated the costs of living in Macau and the originating country. Gunter and Onder (2015) compared the performance of vari- ous forecasting models, such as VAR and ARMA, to predict international tourism demand in Paris. Athanasopoulos and de Silva (2012) deployed multivariate exponential smooth- ing methods for the forecasting of tourism demand for Aus- tralia and New Zealand. Chan and others (2005) used var- ious GARCH models to address variations in tourism de- mand caused by economical instabilities. The use of gravity models for tourism demand analysis has been addressed in (Morley and others 2014). Song and Li (2008) provide a de- tailed survey on various forecasting techniques. A number of tourism demand forecasting approaches are based on usage trend data that are supplied by search en- gines. Xiang and Pan (2011) performed behavior analysis to understand how tourists use search engines for travel plan- ning. They determined which keywords are more important when tourists use search engines. Choi and Varian (2012)

Upload: others

Post on 12-Feb-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Wikipedia in the Tourism Industry: Forecasting Demand and ...people.cs.vt.edu/naren/papers/Wikipedia_tourism_iaai2016.pdf · Proposing a novel framework for tourist behavior extrac-tion

Wikipedia in the Tourism Industry:Forecasting Demand and Modeling Usage Behavior

Pejman KhadiviDiscovery Analytics Center

Computer Science DepartmentVirginia Tech, Blacksburg, Virginia

Email: [email protected]

Naren RamakrishnanDiscovery Analytics Center

Computer Science DepartmentVirginia Tech, Arlington, Virginia

Email: [email protected]

Abstract

Due to the economic and social impacts of tourism, bothprivate and public sectors are interested in preciselyforecasting the tourism demand volume in a timelymanner. With recent advances in social networks, morepeople use online resources to plan their future trips.In this paper we explore the application of Wikipediausage trends (WUTs) in tourism analysis. We proposea framework that deploys WUTs for forecasting thetourism demand of Hawaii. We also propose a data-driven approach, using WUTs, to estimate the behaviorof tourists when they plan their trips.

1 IntroductionTourism is considered as one of the most profitable indus-tries around the world and, hence, tourism demand forecast-ing is important in various domains such as business plan-ning and assessing economic activity in a region. Variousprediction techniques have been used in the literature for de-mand forecasting (Song and Li 2008). Recently, there havebeen different studies that suggest that the use of search en-gine datasets, such as Google Search Trend data can helpimprove forecasting accuracy (Bangwayo-Skeete and Skeete2015; Yang and others 2015).

In recent times, online browsing activity has been identi-fied as a precursor to planning travel and visits to far flungdestinations. Analyzing such online activity can provide de-tailed information about places of interest, volume of inter-est, and spikes/troughs in attention. Our hypothesis in thispaper is that, by looking at Wikipedia usage trends (WUT),we may be able to improve the accuracy of tourism demandforecasts.

Wikipedia usage trends have been previously used forforecasting applications in (Hickmann and others 2015) toforecast the Influenza season. Pertaining to the tourism in-dustry, Wikipedia has been used previously for sightseeingrecommendation and narrative generation purposes (Fangand others 2015; Hecht and others 2007). However, to thebest of our knowledge, there has been no research on usingWUTs for tourism demand forecasting. Beside demand fore-casting, another important tourism-related issue is to extract

Copyright c© 2016, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

the behavior of people when they make decisions about theirtrips, e.g., how early he/she reads about the National Mall orplans to book his/her flight. The answers to such questionsare useful for resource allocation and marketing purposes.

In this paper, we aim to use Wikipedia usage trends for thepurpose of tourism analysis. The main contributions of thispaper are twofold. First, we propose forecasting algorithmsto deploy WUTs and Google Search Trends for tourism de-mand forecasting. Second, we propose a framework for theextraction of tourists’ behavior in trip planning. We per-formed various experiments on a Hawaii tourism dataset.Results show that WUTs can improve the accuracy of de-mand forecasting. Furthermore, the extracted tourist behav-iors are consistent with the results of the surveys conductedby tourism departments. Thus, our contributions are:• Developing a regression approach that deploys WUTs to

improve the accuracy of tourism demand forecasting.• Proposing a novel framework for tourist behavior extrac-

tion from WUTs and tourism demand time series.

2 Related WorkIn order to forecast tourism demand, different univariate andmultivariate regression methods have been deployed in theliterature. Song and Witt (2006) used vector autoregressive(VAR) model to forecast tourist arrivals to Macau. In addi-tion to a tourist arrival time series, they also incorporatedthe costs of living in Macau and the originating country.Gunter and Onder (2015) compared the performance of vari-ous forecasting models, such as VAR and ARMA, to predictinternational tourism demand in Paris. Athanasopoulos andde Silva (2012) deployed multivariate exponential smooth-ing methods for the forecasting of tourism demand for Aus-tralia and New Zealand. Chan and others (2005) used var-ious GARCH models to address variations in tourism de-mand caused by economical instabilities. The use of gravitymodels for tourism demand analysis has been addressed in(Morley and others 2014). Song and Li (2008) provide a de-tailed survey on various forecasting techniques.

A number of tourism demand forecasting approaches arebased on usage trend data that are supplied by search en-gines. Xiang and Pan (2011) performed behavior analysis tounderstand how tourists use search engines for travel plan-ning. They determined which keywords are more importantwhen tourists use search engines. Choi and Varian (2012)

Page 2: Wikipedia in the Tourism Industry: Forecasting Demand and ...people.cs.vt.edu/naren/papers/Wikipedia_tourism_iaai2016.pdf · Proposing a novel framework for tourist behavior extrac-tion

used Google search trends (GST) to forecast the number ofmonthly visitors of Hong Kong. They use data from the firsttwo weeks of a month to forecast the total number of visitorsin the whole month. In (Bangwayo-Skeete and Skeete 2015),GST data about hotels and flights search terms for populartourist destinations in the Caribbean are used for forecastingtourism demand. Yang and others (2015) use GST and Baidutrends to predict tourism demand in the Hainan Province ofChina. In (Artola and others 2015), GST is used to improvethe forecasting of tourism demand for Spain.

3 Problem FormulationIn this paper, we use three sets of time series: tourism de-mand, WUTs, and GSTs. We assume that we have a setof P Wiki-pages related to a specific tourism destination,W = {W1, · · · ,WP }. The WUT of a specific page, Wi, isa time series denoted byWi = wi1, · · · , wit, where wit is thetotal number of times that Wi has been used at time t. Wealso assume that we have a set of Q Google search terms,G = {G1, · · · , GQ} and that the GST data correspondingto Gj is represented by Gj = gj1, · · · , g

jt . The tourism de-

mand time series is illustrated by Y = y1, · · · , yt whereyt is the total number of monthly visitors at time t. In asimilar way, YD and YI are the time series that representthe total number of domestic and international visitors, re-spectively. For a specific time series X , we denote the se-quence of samples between time points 1 and t0 by Xt0 , i.e.Xt0 = [x1 · · ·xt0 ]T . In the following subsections we intro-duce two problems that are studied in the rest of the paper.

3.1 The Demand Forecasting ProblemThe demand forecasting problem is to forecast the volumeof tourism demand (i.e. tourist arrival) in the future. In themost general setting, forecasting the tourism demand at timet based on time series D which is available uptil time t− k,i.e. Dt−k = d1, · · · , dt−k, can be formulated as follows:

yt = F (Dt−k) (1)

where F is the forecasting function and D can be any uni-variate or multivariate time series. As an example, if we aimto forecast tourism demand based on the observed values ofY we will have yt = F (Yt−k).

The problem of tourism demand forecasting using WUTof a Wiki-page, Wi, is formulated in the following equation:

yt = F(Yt−k,Wi

t−k)

(2)

where F(·) is the estimation function. In Eq. 2, the forecast-ing lead time is k months, i.e. we aim to forecast the demandk months earlier. The forecasting problem using GST andother data sources can be formulated in a similar way.

3.2 The Behavior Extraction ProblemLet us assume that a typical trip planning has n activities,A = {A1, · · · ,An}. As an example, these activities may in-clude flight reservation, hotel booking, and planning to visitdifferent attractions. The problem of tourists behavior ex-traction is to determine how long before the actual trip time,

a typical tourist performs each of these activities and is re-ported as a distribution over time. We use aAi

j to show thefraction of tourists that perform activity Ai in j months ear-lier than the actual trip time. Then, we will have:

∞∑j=0

aAij = 1 , i = 1, · · · , n. (3)

In reality, a finite upper bound can be considered for j. Thetraditional approach to acquire the aAi

j values is throughconducting surveys (HTA Report 2002-2013 ). However,since people use online resources for trip planning, we ex-pect that usage trends of online resources can be used toextract average tourist behavior during trip planning. Wiki-pages, based on their contents, can be classified into differ-ent categories. As an example, a Wiki-page that is about anational park can be classified as an attraction. This classifi-cation can be conducted based on the trip planning activitiesin A. Then, the problem of tourists behavior extraction usingWUTs is to determine how tourists visit the pages in eachcategory prior to their trip and is formulated as follows:

aAij = B

(YD,YI ,WAi

)(4)

where, B is the estimation function and WAi is a set ofWUT time series of pages that belong to the category of Ai.

4 Dataset DescriptionIn this paper, we mainly use Wikipedia pages and their us-age trends for tourism in Hawaii. Monthly tourism demandtime series for Hawaii are available through the website ofthe Department of Business, Economic Development andTourism of the state of Hawaii1. We collected Wikipediapages and their usage trends from English Wikipedia andWikipedia Trends2 websites. For this purpose, we conducteda breadth-first-search approach on the graph of Wikipediapages, beginning from the page for Hawaii 3. We retrievedall the Wiki-pages related to Hawaii by extracting pages thatcontain a minimum number of mentions of Hawaii. Usingthis approach we retrieved 3,126 pages. Also, we collectedGST time series for Hawaii, Hawaii flight, and Hawaii hotelsearch terms. Considering the data availability, we focusedour study on the time frame between January 2008 and April2015.

Page Classification: For further analysis, Wiki-pages areclassified into the following tourism-related categories:• Attractions: pages that provide information about various

types of attraction such as beaches and museums.• Flights: pages that provide information about airlines and

airports in Hawaii.• Schools and universities: pages that provide information

about schools, colleges, and universities in Hawaii.• Events: pages that provide information about various

events in Hawaii such as festivals and sport events.

1dbedt.hawaii.gov/visitor/tourism/2www.wikipediatrends.com3en.wikipedia.org/wiki/Hawaii

Page 3: Wikipedia in the Tourism Industry: Forecasting Demand and ...people.cs.vt.edu/naren/papers/Wikipedia_tourism_iaai2016.pdf · Proposing a novel framework for tourist behavior extrac-tion

Figure 1: The classifier used for Wiki-page categorizing.

Table 1: Number of pages in each category.Category # Pages Category # PagesGeneral Information 2202 Attractions 518Flights 53 Events 50Transportation 81 Schools andHotels 20 universities 202

• Transportation: pages that provide information about dif-ferent types of transportation in Hawaii.

• Hotels: pages that provide information about hotels andaccommodations in Hawaii.

• General information: other pages.

Using a dictionary of keywords for each category, Wiki-pages are mapped to the feature space using a bag-of-wordsapproach. Furthermore, training, test, and validation datasetsare constructed by random selection of 85, 60, and 150pages, respectively. These pages are labeled manually basedon their content. Preliminary experiments with various clas-sifiers showed that for each of the pages in the validationdataset, at least one of the kNN and Decision Tree (DT)classifiers return the correct label of the page. Hence, we de-ployed a hierarchical classification model, illustrated in Fig.1, to classify the rest of the pages. In the deployed ensem-ble classifier, we first train kNN and DT classifiers. Then,we classify each Wiki-page using the kNN and DT to haveCK and CT classes, respectively. If CT = CK , we acceptCK as the final class of the page. On the other hand, whenCT 6= CK , we train an SVM classifier to classify the pagefor two classes. The result of the SVM classifier is then ac-cepted as the category of the page. We tested our method onthe test dataset and the classification error was 8%. Table 1shows the number of pages in each category. Tourism de-mand time series are illustrated in Fig. 2(a). We can observethat the monthly count of the domestic tourists is alwayshigher than the monthly count of the international ones. Forcomparison purposes, the average usage trends of the abovecategories are illustrated in Fig. 2(b).

5 Forecasting using Wikipedia Usage TrendsIn this section, we aim to explore how WUT influencesthe accuracy of tourism demand forecasts. A well knownapproach for tourism demand forecasting is to use autore-gressive models, AR(m). Here, we deploy this approach asthe building block of other regression methods. The generalmodel of AR(m) is as follows:

yARt = β +

m−1∑j=0

αjyt−k−j (5)

where, k is the forecasting lead time, αj’s and β are regres-sion coefficients, and m is the order of the model. In the

(a) (b)

Figure 2: (a) Tourism demand time series (b) Average usagetrend of page categories.

following subsections we extend this model to incorporateWUT and GST time series in the forecasting process.

5.1 Regression with Wikipedia Usage TrendsThe general model of tourism demand forecasting usingWUT is defined in Eq. 2. Using a linear regression setting forthe forecasting function F(·), the forecasting model basedon the usage trend of Wiki-page Wi is as follows:

yWit = β +

m−1∑j=0

αjyt−k−j +

m−1∑j=0

γjwit−k−j (6)

where β, αj’s, and γj’s are regression coefficients. In therest of the paper, we denote this model by ARW(m). Notethat ARW(m) is an autoregressive exogenous (ARX) modelwhere the external variable is a WUT time series.

Experiments show that each of the Wiki-pages results in aslightly different estimation and the accuracy of estimationbased on each individual Wiki-page changes over time. Toovercome this variation and in order to improve the accuracyof the forecasts, we use an ensemble method to aggregate theestimations that have been achieved based on different Wiki-pages. Let us assume that we have P Wikipedia pages andwe aim to forecast the tourism demand at time t, yt, based onthe data that is available uptil time t−k. For this purpose, wename the latest v months of the available data as the valida-tion set, V = {yt−k−v+1, · · · , yt−k}. Then, for each Wiki-page, the average performance of ARW(m) is measured onthe validation set, V. The topN pages that result in the high-est accuracy are then selected to perform the final forecast-ing. We represent this set of top-N Wiki-pages as WTop

N .Then, the final forecast is performed as follows:

yWt =1

N

∑Wi∈WTop

N

yWit (7)

where yWit is determined based on Eq. 6. This algorithm,

which we name it as Wiki-based Regression(m,N ), is illus-trated in Algorithm 1.

5.2 Regression with Google Search TrendsLet us assume that we have Q time series of GST and weaim to forecast the tourism demand at time t, yt, based onthe data available uptil time t− k. Then, we will have:

yGit = β +

m−1∑j=0

αjyt−k−j +

m−1∑j=0

γjgit−k−j (8)

Page 4: Wikipedia in the Tourism Industry: Forecasting Demand and ...people.cs.vt.edu/naren/papers/Wikipedia_tourism_iaai2016.pdf · Proposing a novel framework for tourist behavior extrac-tion

Algorithm 1 Forecasting with WUT AlgorithmFunction ARW performs forecasting based on Eq. 61: function WIKI-BASED REGRESSION(W,Y ,N ,t,m,k)2: for p = 1 to P do3: ESum = 04: for i = 1 to v do5: y = ARW(Wp

t−i−k,Yt−i−k,m,k)6: ESum = ESum + (yt−i − y)2

7: RMSEp =√

ESum/v

8: ySum = 09: for i = 1 to N do

10: p = Wiki-page index with ith lowest RMSE11: y = ARW(Wp

t−k,Yt−k,m,k)12: ySum = ySum + y

13: return 1NySum

where β, αj’s, and γj’s are regression coefficients. In the restof the paper, we call this algorithm as ARG(m). Then, thefinal estimation of tourism demand using GST time series,yGt , is calculated using the following equation:

yGt =1

Q

Q∑i=1

yGit (9)

where yGit is calculated using ARG(m) and the overall esti-

mation method is called Google-based Regression(m).Similar to Wiki-based and Google-based regressions, we

can combine the forecasting results of individual regressionsbased on WUTs and GSTs in the following manner:

yWGt =

1

N +Q

∑Wi∈WTop

N

yWit +

Q∑i=1

yGit

(10)

where yWit and yGi

t are the estimations of yt based on Eq. 6and Eq. 8, respectively. We call the overall ensemble methodas Google-Wiki Ensemble.

5.3 Experimental ResultsTo study the performance of the forecasting techniques weperformed experiments on three tourist arrival time seriesof Hawaii: YD, YI , and Y . For each time series, the last12 months are considered as the test dataset. The value ofN in Algorithm 1 was set to 10 for Y and to 15 for YDand YI . The size of the validation set in Algorithm 1, v,was 12 months. We used root-mean-square error (RMSE) ofestimations as a measure of accuracy. Results are illustratedin Figures 3 and 4.

The average performance of various forecasting methodswith different parameter settings are compared in Fig. 3.Here, k and m are changing between 1 and 9. In Fig. 3 (a),(c), and (e), for each value of k, we changed the value ofm from 1 to 9 and reported the average RMSE of the fore-casts. Similarly, in Fig. 3 (b), (d), and (f), for each value ofm, we changed the value of k from 1 to 9 and reported theaverage RMSE of the forecasts. It is clear from Fig. 3 thatalmost everywhere, the accuracy of forecasting based Wiki-based Regression and Google-Wiki Ensemble is higher than

(a) (b)

(c) (d)

(e) (f)Figure 3: The average RMSE of different regression algo-rithms w.r.t. m and lead-time: (a) and (b) All tourists, (c)and (d) domestic tourists, (e) and (f) international tourists.

the accuracy of AR(m) and Google-based Regression. TheRMSE of the four forecasting methods for the best valuesof m are compared in Fig. 4. This figure compares the per-formance of the forecasting methods for four different leadtimes. Results show that in most of the cases either Wiki-based Regression or Google-Wiki Ensemble results in thelowest RMS errors.

The importance of each category of Wiki-pages for do-mestic and international tourists is illustrated in Fig. 5. Inthis experiment, similar to Algorithm 1, we used the methodof ARW(m) to determine the top-250 Wiki-pages, WTop

250 .Then, we measured the ratio of pages that belong to eachcategory in WTop

250 . This figure depicts that domestic and in-ternational tourists have different behaviors. As an example,while for domestic tourists, the Hotel category is highlightedwith lead time of 1, for international tourists it is highlightedwith lead time 6. We can also observe from this figure thatWiki-pages in Flight, Hotel, Events, and Transportation cat-egories are more important than the other three ones.

6 Wikipedia Usage BehaviorOne of the important issues in tourism industry is to knowwhen people start to plan their trips and how they use onlineresources for that purpose. In this section, we propose a data-driven framework based on WUTs to answer this question.

We categorize tourists into two groups: Wikipedia read-ers and non-Wikipedia readers. A Wikipedia reader touristis a person that uses Wikipedia as an online resource before

Page 5: Wikipedia in the Tourism Industry: Forecasting Demand and ...people.cs.vt.edu/naren/papers/Wikipedia_tourism_iaai2016.pdf · Proposing a novel framework for tourist behavior extrac-tion

(a) (b)

(c) (d)Figure 4: Forecasting RMS errors for different regressionalgorithms with (a) one month lead-time, (b) 3 months lead-time, (c) 6 months lead-time, and (d) 9 months lead-time.

Figure 5: Word clouds illustrating the importance of eachclass of Wikipedia pages in forecasting the tourism demand.

and during his trip. We assume that every Wikipedia readertourist visits a Wiki-page only once and that the fraction ofall the tourists that are Wikipedia reader, KWi

, is constantfor each specific page. Hence, if at time twe have yt tourists,yWit of them are Wikipedia readers that read Wiki-page Wi

and we have: KWi = yWit /yt. To simplify the analysis, we

assume that all Wikipedia reader tourists start to read Wiki-pages at most M months before their actual trip time. Re-sults in section 5.3 indicate that domestic and internationaltourists may have different planning behaviors. Therefore,in this framework, we disaggregate the behavior of domesticand international tourists and based on the above assump-tions, we have:

wit =

M∑j=0

KDWibijy

Dt+j +

M∑j=0

KIWicijy

It+j + νt + et (11)

where, wit is the number of visits to the Wiki-page Wi,and yDt+j and yIt+j are the number of domestic and interna-tional tourists at time t+ j, respectively. Furthermore, KD

Wi

and KIWi

represent the ratio of domestic and internationalWikipedia reader tourists, respectively. Also, bij’s and cij’sare constant coefficients that determine ratio of Wikipedia

reader tourists that have read Wi, j months before the triptime. Note that since coefficients bij and cij represent thereading distributions, these coefficients cannot be negativeand each set should be summed up to 1. In Eq. 11, νt and etare two error components, illustrating the uncertainties thatwe have in the model. Uncertainty (or variation) in the as-sumptions and data (i.e. noise in KWi

, wit, etc.) is modeledby et and Wiki-page visits that have a non-tourist reason areshown by νt. The intuition behind this model is that the num-ber of Wikipedia page-visits for a specific area is a linearfunction of the number of tourists that will visit that specificarea (i.e. in this paper Hawaii) in the future, plus some noise.

Yearly trip-planning surveys conducted for Hawaii (HTAReport 2002-2013 ) show that trip planning activities fol-low a normal-like distribution. Since we assumed thatWikipedia-reader tourists read Wiki-pages during their tripplanning, we expect that the reading distributions follow anormal-like distribution as well. Therefore, in order to applythis prior knowledge to the model and to dictate the estima-tion model to follow a normal-like distribution we need toadd the following set of constraints:

b0 ≤ b1 ≤ · · · ≤ bβ−1 ≤ bβ ≥ bβ+1 ≥ · · · ≥ bM (12)c0 ≤ c1 ≤ · · · ≤ cζ−1 ≤ cζ ≥ cζ+1 ≥ · · · ≥ cM

where β and ζ are indexes that the maximum value of distri-butions occur for b and c, respectively.

In order to estimate the Wikipedia reading distributionsof domestic and international tourists, we first drop the errorfactors, νt and et, from Eq. 11 to calculate the estimatednumber of visits to the Wiki-pageWi, i.e. wit. Then, we solvethe following optimization problem:

bij∗, cij∗,KD

Wi

∗,KI

Wi

∗= arg min

1

T

T∑t=1

(wit − wit

)2(13)

s.t. 0 < KI ≤ 1 , 0 < KD ≤ 1

0 ≤ bij ≤ 1 , 0 ≤ cij ≤ 1 , j = 0, 1, · · · ,MM∑j=0

bij =

M∑j=0

cij = 1

b0 ≤ b1 ≤ · · · ≤ bβ−1 ≤ bβ ≥ bβ+1 ≥ · · · ≥ bMc0 ≤ c1 ≤ · · · ≤ cζ−1 ≤ cζ ≥ cζ+1 ≥ · · · ≥ cM

In the optimization problem of Eq. 13, M , β, and ζ aremodel parameters which are estimated through cross valida-tion.

Note that solving the problem of Eq. 13 results in tworeading distributions: bij’s for domestic tourists and cij’s forinternational ones. It is easy to show that by taking theweighted average of these two distributions, one can calcu-late the average reading distribution of all tourists. In otherwords, if N is the total number of tourists and ND is thenumber of domestic tourists we will have:

aij =NDN

bij +N −ND

Ncij , j = 0, 1, · · · ,M (14)

The average of reading distributions for category of pages(as defined in Section 4) are illustrated in Fig. 6. In order to

Page 6: Wikipedia in the Tourism Industry: Forecasting Demand and ...people.cs.vt.edu/naren/papers/Wikipedia_tourism_iaai2016.pdf · Proposing a novel framework for tourist behavior extrac-tion

(a) (b) (c)Figure 6: Reading distributions averaged over different page categories for (a) all, (b) domestic, and (c) international tourists.

calculate the average distribution of category A we use thefollowing equation:

aAj =

∑Wi∈AKWi

aij∑Wi∈AKWi

(15)

where KWiis the average ratio of Wikipedia reader tourists

for pageWi. Similar to Eq. 14,KWiis calculated usingKD

Wi

and KIWi

. Fig. 6(a) shows the average reading distributionsfor all the tourists. The average reading distributions of do-mestic and international tourists for category A, i.e. bAj andcAj , can also be calculated in a similar way and are illus-trated in Fig. 6 (b) and (c). Figure 6 also shows the averageratio of Wikipedia reader tourists for each category which iscalculated as follows:

K =

∑Wi∈AKWi

|A|(16)

where |A| represents the size of category A.Figure 6(a) depicts that on average, most of the Wikipedia

reading activities have occurred about 4 to 8 months prior tothe trip. This is inline with the surveys reported in (HTAReport 2002-2013 ) as the mean decision date for most ofthe activities are between 4 to 8 months before the actualarrival date. Figure 6(a) also indicates that the most popu-lar categories are Events, General information, and Flights.A comparison of Fig. 6(b) and (c) shows that internationaltourists start planning their trip about 4 to 6 months beforethe domestic ones. Furthermore, we can observe that on av-erage, the ratio of Wikipedia reader tourists is higher amonginternational tourists than the domestic ones.

7 ConclusionIn this paper we showed that Wikipedia usage trends canbe effectively used in tourism planning. Experimental re-sults indicated that WUT time series improve the accuracy oftourism demand forecasts. We also used WUT and tourismdemand time series to estimate the behavior of tourists in tripplanning. Results are consistent with the survey results gath-ered from domestic and international tourists of Hawaii. Asfuture work, we aim to use other sources (e.g. TripAdvisor)to rank the Wikipedia pages and consider the relationshipgraph of Wiki-pages to improve forecasting and behavior es-timation performances.

ReferencesArtola, C., et al. 2015. Can internet searches forecasttourism inflows? Int. Journal of Manpower 36(1):103–116.Athanasopoulos, G., and de Silva, A. 2012. Multivariate ex-ponential smoothing for forecasting tourist arrivals. Journalof Travel Research 51(5):640652.Bangwayo-Skeete, P., and Skeete, R. 2015. Can googledata improve the forecasting performance of tourist ar-rivals? mixed-data sampling approach. Tourism Manage-ment 46:454–464.Chan, F., et al. 2005. Modelling multivariate interna-tional tourism demand and volatility. Tourism Management26:459471.Choi, H., and Varian, H. 2012. Predicting the present withgoogle trends. The Economic Record 88:2–9.Fang, G., et al. 2015. How to extract seasonal features ofsightseeing spots from twitter and wikipedia. Bulletin ofNetworking, Computing, Systems, and Software 4(1):21–26.Gunter, U., and Onder, I. 2015. Forecasting internationalcity tourism demand for paris: Accuracy of uni- and multi-variate models employing monthly data. Tourism Manage-ment 46:123–135.Hecht, B., et al. 2007. Generating educational tourism narra-tives from wikipedia. In Intelligent Narrative Technologies,Papers from the 2007 AAAI Fall Symposium, 37–44.Hickmann, K., et al. 2015. Forecasting the 20132014 in-fluenza season using wikipedia. PLoS Comput Biol 11(5).HTA Report 2002-2013. Visitor satisfaction & ac-tivity. www.hawaiitourismauthority.org/research/reports/visitor-satisfaction/Last accessed Aug. 2015.Morley, C., et al. 2014. Gravity models for tourism demand:theory and use. Annals of Tourism Research 48:1–10.Song, H., and Li, G. 2008. Tourism demand modeling andforecasting-a review of recent research. Tourism Manage-ment 29:203–220.Song, H., and Witt, S. F. 2006. Forecasting internationaltourist flows to macau. Tourism Management 27:214224.Xiang, Z., and Pan, B. 2011. Travel queries on cities in theunited states: Implications for search engine marketing fortourist destinations. Tourism Management 32:8897.Yang, X., et al. 2015. Forecasting chinese tourist volumewith serach engine data. Tourism Management 46:386–397.