predicting disease outbreaks using social media: finding ...finding related and reliable posts on...

19
Predicting Disease Outbreaks Using Social Media: Finding Trustworthy Users Razieh Nokhbeh Zaeem David Liau K. Suzanne Barber UTCID Report #18-07 MAY 2018

Upload: others

Post on 13-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

Cover pageInside cover with acknowledgment of partners and partner lo-gos1-page executive summary with highlights from statistics.3 main sections – each of these will have all of the graphs from the list of 16 that pertain to each respective section:EventsVictimsFraudstersThe format for each of these sections will more or less like a prettier version of the ITAP Template document I sent you. Ba-sically each page will have 3+ graphs depending on design lay-out (however many fit) and then 2 text sections copy for:“Explanations”: A 1-2 sentence description of the graph“Insights” a 1-paragraph exposition on what the statistic indi-cates.

Predicting Disease Outbreaks Using Social Media: Finding Trustworthy Users

Razieh Nokhbeh ZaeemDavid LiauK. Suzanne Barber

UTCID Report #18-07

MAY 2018

Page 2: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media
Page 3: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

Predicting Disease Outbreaks Using SocialMedia: Finding Trustworthy Users

Razieh Nokhbeh Zaeem, David Liau, and K. Suzanne Barber

Center for Identity, The University of Texas at Austin,[email protected], [email protected],

[email protected]

Abstract. The use of Internet data sources, in particular social media,for biosurveillance has gained attention and credibility in recent years.Finding related and reliable posts on social media is key to performingsuccessful biosurveillance utilizing social media data. While researchershave implemented various approaches to filter and rank social mediaposts, the fact that these posts are inherently related by the credibilityof the poster (i.e., social media user) remains overlooked. We propose sixtrust filters to filter and rank trustworthy social media users, as opposedto concentrating on isolated posts. We present a novel biosurveillance ap-plication that gathers social media data related to a bio-event, processesthe data to find the most trustworthy users and hence their trustworthyposts, and feeds these posts to other biosurveillance applications, includ-ing our own. We further present preliminary experiments to evaluate theeffectiveness of the proposed filters and discuss future improvements.Our work paves the way for collecting more reliable social media data toimprove biosurveillance applications.

Keywords: Biosurveillance, Social Media, Twitter, Trust

1 Introduction

Thanks to the ever-growing use of social media, the Internet is now a rich sourceof opinions, narratives, and information, expressed by millions of users in theform of unstructured text. These users report, among many other things, theirencounters with diseases and epidemics. Internet biosurveillance utilizes the datasources found on the Internet (such as news and social media) to improve de-tection, situational awareness, and forecasting of epidemiological events. In fact,since mid 1990’s, researches have used Internet biosurveillance techniques topredict a wide range of events, from influenza [5] to earthquakes [9]. Internetbiosurveillance takes advantage of what is called hivemind on social media—thecollective intelligence of the Internet users.

The sources of Internet biosurveillance (e.g., social media) are, generally,timely, comprehensive, and available [10]. These sources, however, are enormousand noisy. An important pre-processing step to draw meaningful results fromthese sources is to filter and rank the most related parts of the data sources.

Page 4: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

2 Razieh Nokhbeh Zaeem et al.

Such filtering and ranking is widely recognized in the literature. For instance, intheir overview of Internet biosurveillance [10], Hartley et al. break the processof Internet biosurveillance into four steps: (1) the collection of data from theInternet, (2) the processing of the data into information; (3) the assemblingof that information into analysis; and (4) the propagation of the analysis tobiosurveillance experts. They identify relevancy ranking as one of the importantsub-steps of processing data into information in step two, before the actualanalysis begins in step three.

In order to filter and rank the posts (i.e., Twitter posts or news articles),researchers have implemented various approaches, like Machine Learning (e.g.Naive Bayes and Support Vector Machines [19, 6]), and Natural Language Pro-cessing (e.g., Keyword and Semantic-based Filtering [8] and Latent DirichletAllocation [7]). All the previous efforts, however, have focused on ranking theposts independently [12], ignoring the fact that these posts (Twitter posts ornews articles) are inherently related by the virtue of the credibility of the poster(the Twitter user or news agency).

Furthermore, users of social media can post about anything they wish to talkabout. Some users talk about their illnesses online and these are the users wewish to monitor as they give us a sampling of the union’s infectious disease state.However, users can talk about being ill to illicit sympathy from other users, orthey can just be faking it. It is important to evaluate the trustworthiness of usersbefore extracting data for analysis.

Unlike previous work, we observe the fact that the credibility of the userswith respect to a given epidemiological event should be taken into account whenfiltering and ranking related posts. We propose six trust filters that filter andrank social media users who post about epidemiological events: Expertise, Expe-rience, Authority, Reputation, Identity and Proximity. These trust filters obtainthe credibility or trustworthiness of a user by considering the structure of thesocial network (e.g., the number of Twitter followers), the user’s history of posts,the user’s geo-location, and his/her most recent post.

While we focus on the relevancy ranking sub-step by measuring the usertrustworthiness, we introduce a comprehensive framework that performs theentire cycle of Internet biosurveillance as explained by the four steps mentionedby Hartley et al. [10]. We leave technical details of some of the steps out of thispaper, and discuss them separately elsewhere.

Finally, in a preliminary set of experiments, we collect the posts and geo-locations of 2,000 real Twitter users. We investigate the effectiveness of our pro-posed trust filters. We observe the statistics of the filter scores and correlationsbetween the filters and suggest future improvements.

2 Overview: Surety Bio-Event App

The Surety Bio-Event App is our Internet biosurveillance application developedat the University of Texas at Austin for the DTRA Biosurveillance Ecosystem(BSVE) [18] framework. The BSVE provides capabilities allowing for disease

Page 5: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

Predicting Disease Outbreaks Using Social Media 3

Fig. 1. Overview of the Surety Bio-Event App.

prediction and forecasting, similar to the functionality of weather forecasting.The BSVE is a virtual platform with a set of integrated tools and data ana-lytics which support real-time biosurveillance for early warning and course ofaction analysis. The BSVE provides a platform to access a large variety of socialmedia data feeds, a software development kit to create applications (apps), var-ious tools, and the cloud service to host a web-based user interface. Developersdevelop BSVE apps and deploy them to the BSVE to be ultimately used bybiosurveillance experts and analysts.

Our Surety Bio-Event app covers the entire cycle of Internet biosurveillanceaccording to previous work [10]. Figure 1 shows a high level picture of the SuretyBio-Event App. The four steps are: (1) Multi-Source Real-Time Data which col-lects data (Section 5), (2) Trust Filter which processes data into information(Section 3), (3) Surveillance Optimization (including early detection, situationalawareness and prediction) which assembles the information into analysis (Sec-tion 6), and (4) Forecasts and Predictions which propagates the analysis toexperts through a Graphical User Interface (Section 4). Furthermore, the Suretyapp is user customizable and receives Goals and Situational Awareness as wellas Historical Data, Detections, and Predictions from biosurveillance experts.

Figure 2 shows a more detailed view of the App. In this paper, we concen-trate on the second step, the trust filter, while we broadly review the other stepstoo. With data collected from social media, the trust filter component of theApp evaluates the data sources to find the most trustworthy social media userswith respect to a given surveillance goal. The trust filter component optimizesrange, availability and quality of data using the combination of algorithms mea-suring six dimensions of trust: Expertise, Experience, Authority, Reputation,Identity and Proximity. The primary functions of the trust filter component are:

Page 6: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

4 Razieh Nokhbeh Zaeem et al.

Fig. 2. Diagram of Data Collection and Analysis with the Surety Bio-Event App(SBEA).

(1) improving the quality of data employed by BSVE applications and analyststo make biosurveillance decisions, (2) tracking and quantifying trustworthinessof known, preferred users to guard against data bias and quality drift for BSVEapplications and analysts, and (3) expanding the landscape of possible trustedsocial media users by offering trusted but previously unexplored users via rec-ommendation notifications to BSVE applications and analysts.

3 Trust Filters

In order to determine user trustworthiness, we introduce the concept of a trustfilter—a score between 0 and 1 assigned to a user (e.g., a Twitter user) whichrates his/her trustworthiness with respect to a given criteria. We propose sixtrust filters:

Expertise Expertise measures a user’s involvement in the subject of interest [3].We define Expertise as the probability that a user will generate content on thetopic in question (e.g., an Influenza outbreak). Using the user’s history of posts,Expertise can be calculated as how often a specific user has written about thesubject of interest in the past.

Expertise(ui, t) = p(t |ui) = #Posts(ui, t)/#Posts(ui), where ui is a user in thesocial media network, t is a topic, and p(t |ui) is the probability that a user hasgenerated content on that topic. We calculate this probability by counting thenumber of that user’s posts on the topic and dividing by his/her total numberof posts. For all the filters, we use a keyword based classifier to distinguish theposts concerting the topic of interest and the users posting about that topic.

Page 7: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

Predicting Disease Outbreaks Using Social Media 5

Experience Experience is the degree to which a user’s posts are corroborated byother users. Informally, Experience seeks to measure how a user’s posts abouta subject are corroborated by the ground truth. Assuming that the averageinvolvement of all users in the subject of interest reveals the truth about theoutside world (e.g., everybody posts about flu when a flu outbreak actuallyhappens), we can use this average to calculate Experience. In order to do so,we measure the difference between a user’s involvement in the subject usingExpertise and the average Expertise. To get a score that is between 0 and 1, andusing the fact that Expertise is already between 0 and 1, we calculate Experienceas Experience(ui, t) = 1 − |Expertise(t) − Expertise(ui, t) |The closer one’s Expertise to the average Expertise, the higher his/her Experi-ence score.

Authority Authority is the number and quality of social media links a userreceives from Hubs as an Authority [3]. A link is the relationship between users,e.g., likes and comments on Facebook, and following on Twitter. We utilize theHyperlink-Induced Topic Search (HITS) [11] algorithm, a link analysis algorithmwidely used to rank Web pages and other entities that are connected by links, toget a score between 0 and 1. In this algorithm, certain users, known as Hubs, serveas trustworthy pointers to many other users, known as Authorities. Therefore,Authorities are the users that have been recognized within the social mediacommunity.

Reputation Reputation is the number and quality of social media links toa user. We utilize the PageRank algorithm [2], another widely used rankingalgorithm, to get a score between 0 and 1.

Identity Identity is the degree of familial or social closeness between a user andthe person afflicted with the disease. The Identity filter is defined as the rela-tionship between the posting user that talks about the disease and the subject ofthe post that has somehow encountered the disease. If the user is reporting thedisease about himself/herself, the Identity score assigned would be the maximumvalue, which is 1. If the user reports about a closer family member, the scorewould be higher compared to when the user reports about an acquaintance ofhis/hers. We utilize Natural Language Processing and Greedy algorithms to cal-culate this score. This trust filter first finds all possible grammatical subjects ofa sentence (e.g., a Twitter post), then using the words in the family tree, it findsthe closest family relationship to those subjects and reports that family relation-ship (e.g., self, mother, co-worker, son) for Identity. A score is assigned to thisrelationship ranging from 1 (i.e., reporting disease about self) to 0 (i.e., talkingabout total strangers). In order to get the Identity score of a user, the Identityof all of his posts about the subject of interest are calculated and averaged. Moredetails on this filter can be found in our previous work [13].

Page 8: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

6 Razieh Nokhbeh Zaeem et al.

Proximity Proximity estimates the distance of a user from the event (e.g.,disease outbreak location). Using relationship distance (i.e., Identity score) andgeographical distance (through geo-tagged posts and the geo-location of theuser), Proximity utilizes a greedy algorithm to perform graph traversal over thesocial media network and then combines the Identity value with the distancevalue to calculate the Proximity as shown in Algorithm 1.

Algorithm 1: Proximity Algorithm

Input : Directed user graph GOutput: Proximity scores user .proximity

1 Initialize Identity threshold: T ;2 for user in users do3 if user .identity > T then4 user .separation = 1/user .identity;5 else6 user .separation = ∞;7 end

8 end9 for user u in G do

10 for user v in G − {u} do11 distance = v → u;12 u.separation = min(u.separation, v.separation × distance);13 end

14 end15 for user in users do16 user .proximity = 1 − user .separation;17 end

Note that, the network graph that the trust filters use is pruned so that itcontains only those users that have posted (at least once) about the subject ofinterest. As a result, trust filter scores are calculated focusing on the communitythat discusses a particular subject on social media.

4 Trust Filter GUI

Figure 3 displays the Graphical User Interface (GUI) of the trust filter tab ofthe Surety app. The GUI is composed of four smaller windows. On the top left,the social media users are listed, and for each, the value of each of the six trustfilters is shown. Next to the gear icons, the names of the six trust filters appear:Identity, Reputation, Experience, Expertise, Authority, and Proximity. The lastcolumn is the Combined trust score, currently the average of the six filters.

On the GUI, the analyst or BSVE app developer selects a trust filter. He/shecan then sort the users with respect to that score (descending or ascending).The higher the score, the more trustworthy the user with respect to that trust

Page 9: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

Predicting Disease Outbreaks Using Social Media 7

Fig. 3. Trust Filter GUI of the Surety Bio-Event App.

filter. In Figure 3, the users are sorted based on Proximity in descending order.The analyst or BSVE app developer can also select favorite users that overtimehe/she has found trustworthy and mark them with a star. The GUI suggestssocial media users that have a higher combined score compared to the favoriteusers with a blue glow under the user name (trusted but previously unexploredusers) as shown in the Figure. The analyst can review the favorite users (bringall the favorites to the top) too.

On the GUI, the Network Graph is the top right window, which displaysthe users on social media as nodes and their links (e.g., following on Twitter)and sizes. The analyst can select a trust filter to size the nodes in the NetworkGraph. In this Figure, the node sizes are based on Identity.

On the bottom left of the GUI, under Node Histogram, the GUI charts thetrust filter scores of users with the top five users for the selected filter.

On the bottom right, under Trust Score Distribution, the GUI displays therange of user trustworthiness, based on each filter and the combined score. Thedistribution of user trust scores with tunable granularity (set to 0.1 in this Fig-ure) shows the number of social media users that have a given trust score.

Page 10: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

8 Razieh Nokhbeh Zaeem et al.

5 Data Collection

In this section and the next, we briefly overview the first and third steps of thebiosurveillance process, namely data collection and optimization, for the sake ofcompleteness.

The Surety app (1) uses data already available on the BSVE and (2) collectsdata and uploads to the BSVE. The data sources monitored within the BSVE in-clude well established and trusted data providers such as the Centers for DiseaseControl (CDC) and the World Health Organization (WHO). Data from thesesources show the analyst working with the BSVE the best possible measure ofthe state of disease within the country. In addition, the BSVE collects data fromnews sources and Twitter. From what the BSVE already provides, Twitter doescontain a treasure trove of information. However, other sources such as blogs,Instagram, and Reddit have been under used. The Surety app aims to fix thesegaps in data collection. The trust filter part of the Surety App seeks to col-lect data from other sources not currently supported by the BSVE that containconnectivity network information, and are typically focused on individuals asopposed to news feeds.

Figure 4 demonstrates some of the data sources for the Surety app. Note thatnot all the data sources are candidates to be used with trust filters. Some of thesedata sources provide only time series data which is used by the optimization part.The data sources that are appropriate for trust filters are as follows. For thesesources, we have implemented methods within our API to collect historical userdata as well as connections to streaming APIs: Twitter, WordPress, Instagram,Tumbler, Reddit, and Wikipedia.

6 Optimization

The third step of the biosurveillance process analyzes large collections of trusteddata sources to assemble systems that efficiently achieve user specified surveil-lance goals, such as early outbreak detection. This analysis is accomplishedthrough optimization algorithms that evaluate data collections through com-parison to historical and simulated bio-events. The Surety app yields trusteddata sources, along with statistical models and performance metrics to supportfuture surveillance activities. The trust filter part of the Surety App is capableof collecting a wide-range of data then formatting that data into the requiredtime series data source for the optimization part. Our optimization algorithms,discussed elsewhere, include early detection, situational awareness and predic-tion [14].

7 Implementation

Our app is implemented with a Python Flask back-end and JavaScript front-end. The back-end was developed to allow for user interactivity to the front end.It serves JSON data generated from the algorithms to the user interface. Theapplication is integrated into the BSVE.

Page 11: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

Predicting Disease Outbreaks Using Social Media 9

Fig. 4. Data Collection Sources of the Surety App.

8 Experiments

We have designed a preliminary set of experiments to answer the following re-search question: How well do the proposed filters perform? In order to answerthis question, we plan to use seed data (e.g., a synthetic network of users, posts,and disease outbreaks) as well as actual data (e.g., actual network of Twitterusers and their posts).

1. We observe the value of the trust filters and their trends.2. We compare filter scores against hospital data to judge the ability of the

trust filters to detect disease outbreaks.

In this paper, we observe the trend of the proposed trust filters for a real networkof 2,000 Twitter users with their posts. The use of seed data as well as thecomparison with hospital data is work under progress.

For this set of experiments, we downloaded the posts and geo-location of 2,000Twitter users. In order to do so, we performed a keyword search of the word ‘flu’on Twitter API and then downloaded the user profile information (including geo-location coordinates), the user’s friends’ time-lines, lists of friends and followers,and past 30 days of tweets. We started the download on July 22, 2016 and,

Page 12: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

10 Razieh Nokhbeh Zaeem et al.

because of Twitter’s bandwidth limitations, it took us a week to download 2,000users that have posted at least once with the word ‘flu’, totaling 33GB. Notethat not all the posts of these users over past 30 days are necessarily about flu.We use a keyword based classifier to distinguish flu-related posts.

Figure 5 shows the filters’ maximum, minimum and average values. The Iden-tity trust filter has an average (as well as peak) value at about 0.48, which meansthat, when people do post about flu, they tend to post about flu encounters oftheir nuclear family members, as 0.5 is assigned to nuclear family members forthe Identity score. Reputation and Authority scores are unanimously close to 0,implying that the network we downloaded had very little connectivity. The lowdegree of connectivity is expected since people who post about flu do not neces-sarily tend to follow others who post about flu. The average value of Expertisewas close to 0 too, meaning that even among those who have posted about fluat least once, the number of flu related posts over a 30 day period was rela-tively very low. The average value of 0.95 for Experience shows that most users’Expertise score was close to the average Expertise, i.e. close to 0. Investigatingthe out-liners should point to users that were unusually concerned about flu.Finally, we found that Proximity should be re-defined to make it independent ofIdentity, to show concrete distance from outbreak locations.

Figures 6, 7, 8, and 9 display the most interesting correlations we found be-tween the filter values. Figure 6 shows that the combined score is most heavilyinfluenced by Identity; these two filters are related with R2 equals to 0.49. There-fore, we might need to normalize and weigh the filters to get a new less-biaseddefinition of the Combined score.

Fig. 5. Statistics of Trust Filters.

Page 13: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

Predicting Disease Outbreaks Using Social Media 11

Fig. 6. Correlation between Combined Filter and Identity.

Fig. 7. Correlation between Reputation and Authority.

Page 14: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

12 Razieh Nokhbeh Zaeem et al.

Fig. 8. Correlation between Expertise and Experience.

Figure 7 charts the correlation between Reputation and Authority filters(R2 = 0.15). These two filters are not closely related. Therefore, while bothmeasure the connectivity of the network, they consider different aspects of con-nectivity.

Figure 8 confirms that Experience and Expertise are inversely correlated. Wemight need to update the definition of Experience to measure the corroborationby others differently.

Finally, while Proximity is initialized with Identity, as Figure 9 shows, it israther independent of Identity. While the Proximity of users to a potential out-break location can be compared to one another, the absolute value of Proximitystill does not show the concrete physical distance between the user and a fluoutbreak location.

Page 15: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

Predicting Disease Outbreaks Using Social Media 13

Fig. 9. Correlation between Identity and Proximity.

8.1 Feature Importance

We compare our trust filters with other simple features which are widely studiedin processing Twitter data [16]. Figure 10 and Table 1 show the feature impor-tance score from the Scikit-Learn kit [17]. We use the Extremely RandomizedTree Classifier as our method to evaluate the importance of each feature. Weutilize a library [1, 15] in which the Gini coefficient is used as a measure to theimportance of each feature. In short, the total importance scores sum up to oneand the larger the score is, the more important in decision that feature is. AsTable 1 shows, the best feature from Extremely Random Tree Classifier is thenumber of posts by a specific user within the given period of time. Consequently,the filters that are based on the number of related posts, such as Experience andExpertise, work well. However, the number of posts can be easily forged withposting robots or Spam posts. Two other features that are known to performwell in similar types of problems are the average post length and the number oftagged Twitter IDs which start with the symbol @ [4]. Therefore, potential filtersto consider can be based on these features. Identity, Reputation, and Proximityall perform better than the other features studied in previous work, includingretweet, and whether or not the posts contain ′?′ and ′!′. Finally, Authorityperforms poorly and can be considered irrelevant.

Page 16: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

14 Razieh Nokhbeh Zaeem et al.

Fig. 10. Feature Importance.

Table 1. Features and Corresponding Importance Scores.

Feature Importance Score

Number of Posts 0.205Experience 0.143Expertise 0.132Avg. Post Length 0.129Number of @ Tags 0.111Identity 0.100Reputation 0.099Proximity 0.033Retweet 0.029Contains ′?′ 0.010Contains ′!′ 0.009Authority 0.002

9 Conclusion

Filtering and ranking social media posts is essential to biosurveillance applica-tions that monitor them to detect and forecast disease outbreaks. We introduceda novel way to filter and rank social media posts by concentrating on the trust-worthiness of social media users with respect to a given subject. We proposedsix trust filters and used them in the context of a complete biosurveillance appli-cation. We further evaluated these trust filters by observing how they perform

Page 17: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

Predicting Disease Outbreaks Using Social Media 15

on a real set of Twitter posts downloaded from 2,000 users for over 30 days. Im-proving the filter definitions and judging the effectiveness of the filters in findingactual disease outbreaks are two major future work directions.

10 Acknowledgment

Surety Bio-Event App is a long term project of the Center for Identity. Theauthors thank Guangyu Lin, Roger A. Maloney, Ethan Baer, Nolan Corcoran,Benjamin L. Cook, Neal Ormsbee, Haowei Sun, Zeynep Ertem, Kai Liu, andLauren A. Meyers for their contribution to this project. This work has beenfunded by Defense Threat Reduction Agency (DTRA) under contract HDTRA1-14-C-0114 CB10002.

References

1. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification andRegression Trees. Statistics/Probability Series. Wadsworth Publishing Company,Belmont, California, U.S.A., 1984.

2. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine.Computer networks and ISDN systems, 30(1):107–117, 1998.

3. S. Budalakoti and K. S. Barber. Authority vs affinity: Modeling user intent inexpert finding. In Social Computing (SocialCom), 2010 IEEE Second InternationalConference on, pages 371–378. IEEE, 2010.

4. C. Castillo, M. Mendoza, and B. Poblete. Information credibility on twitter. InProceedings of the 20th International Conference on World Wide Web, WWW ’11,pages 675–684, New York, NY, USA, 2011. ACM.

5. N. Collier, N. T. Son, and N. M. Nguyen. OMG U got flu? analysis of shared healthmessages for bio-surveillance. Journal of biomedical semantics, 2(5):S9, 2011.

6. K. Denecke, M. Krieck, L. Otrusina, P. Smrz, P. Dolog, W. Nejdl, E. Velasco,et al. How to exploit twitter for public health monitoring. Methods Inf Med,52(4):326–39, 2013.

7. E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Epidemicintelligence for the crowd, by the crowd. The International AAAI Conference onWeb and Social Media, 12:439–442, 2012.

8. S. Doan, L. Ohno-Machado, and N. Collier. Enhancing twitter data analysis withsimple semantic filtering: Example in tracking influenza-like illnesses. In HealthcareInformatics, Imaging and Systems Biology (HISB), IEEE Second InternationalConference on, pages 62–71, 2012.

9. S. Doan, B.-K. H. Vo, and N. Collier. An analysis of twitter messages in the 2011tohoku earthquake. In International Conference on Electronic Healthcare, pages58–66. Springer, 2011.

10. D. M. Hartley, N. P. Nelson, R. Arthur, P. Barboza, N. Collier, N. Lightfoot,J. Linge, E. Goot, A. Mawudeku, L. Madoff, et al. An overview of internet bio-surveillance. Clinical Microbiology and Infection, 19(11):1006–1013, 2013.

11. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal ofthe ACM (JACM), 46(5):604–632, 1999.

12. A. Lamb, M. J. Paul, and M. Dredze. Separating fact from fear: Tracking fluinfections on twitter. In HLT-NAACL, pages 789–795, 2013.

Page 18: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

16 Razieh Nokhbeh Zaeem et al.

13. G. Lin, R. Nokhbeh Zaeem, H. Sun, and K. S. Barber. Trust filter for diseasesurveillance: Identity. In IEEE Intelligent Systems Conference, pages 1059–1066,Sep 2017.

14. K. Liu, R. Srinivasan, Z. Ertem, and L. Meyers. Optimizing early detection ofemerging outbreaks. In Poster presented at: Epidemics6, Sitges, Spain, Nov 2017.

15. G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts. Understanding variable im-portances in forests of randomized trees. In Proceedings of the 26th InternationalConference on Neural Information Processing Systems - Volume 1, NIPS’13, pages431–439, USA, 2013. Curran Associates Inc.

16. J. ODonovan, B. Kang, G. Meyer, T. Hllerer, and S. Adalii. Credibility in context:An analysis of feature distributions in twitter. In 2012 International Conferenceon Privacy, Security, Risk and Trust and 2012 International Confernece on SocialComputing, pages 293–301, Sept 2012.

17. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machinelearning in python. J. Mach. Learn. Res., 12:2825–2830, Nov. 2011.

18. Digital Infuzion. DTRA Biosurveillance Ecosystem (BSVE), 2017.19. M. Torii, L. Yin, T. Nguyen, C. T. Mazumdar, H. Liu, D. M. Hartley, and N. P.

Nelson. An exploratory study of a text classification framework for internet-basedsurveillance of emerging epidemics. International journal of medical informatics,80(1):56–66, 2011.

Page 19: Predicting Disease Outbreaks Using Social Media: Finding ...Finding related and reliable posts on social media is key to performing successful biosurveillance utilizing social media

W W W.IDENTIT Y.UTEX AS.EDU

Copyright ©2019 The University of Texas Confidential and Proprietary, All Rights Reserved.