university depression rankings using twitter datacse400/cse400_2014_2015/reports/0… · university...

University Depression Rankings Using Twitter Data

Dept. of CIS - Senior Design 2014-2015∗

Ashwin [email protected]

Univ. of PennsylvaniaPhiladelphia, PA

Jason [email protected]. of Pennsylvania

Philadelphia, PA

Tommy Pan [email protected]


Yaou [email protected]


ABSTRACTWith the rise of social media, university rankings are play-ing an increasingly influential role in the selection processfor prospective university students. Simultaneously, mentalhealth has risen to the forefront of discussions among univer-sities nationwide, in light of calls for increased mental illnessawareness. Previous attempts at formulating rankings ofschools’ happiness and mental illness have centered aroundpaper or electronic surveys taken by only a small fraction ofthe student body. Our work posits a new methodology forconstructing college depression rankings through analysis ofthe language used by students on social media platforms. Byusing a corpus of 78 million Tweets generated from Septem-ber 2014 to March 2015 and leveraging existing research intodepression language analysis, we produce a set of meaningfulrankings comparing depression among schools.

1. INTRODUCTIONIn this paper, we propose a novel approach to ranking uni-

versities along the dimension of depression. Such rankingsnot only influence the prestige of a university but also thedecisions of high school students when determining whereto spend the next four years of their life. However, cur-rent methodologies for computing depression rankings areneither robust nor scalable.

Our approach leverages social media data and existing de-pression models to produce rankings that provide significantimprovements in robustness and scalability. The resultingrankings provide key insights into correlations between de-pression at universities and characteristics of universities,such as the prevalence of a pre-professional culture at theuniversity.

Finally, although our rankings are specific to depression,our approach can be generalized and applied to other ar-eas as well. For example, our method could be used torank universities on the dimensions of health. Alternatively,the model could be refined to make the rankings more fine-grained (ex. at the student group level) or more coarse (ex.at the region level).

2. BACKGROUND∗Advisor: H. Andrew Shwartz ([email protected])and Chris Callison-Burch ([email protected]).

For universities, rankings play an important role in influ-encing prestige, endowments, administrative decisions, andperhaps most importantly the college choices of prospectivestudents in high school. A set of rankings that are rising inimportance are “happiness” rankings, a listing of universi-ties based on the purported happiness of their students. Inlight of mental health awareness rising to the forefront of na-tional attention through frequent and high-profile suicides,students and administrators are pushing for more efforts inmonitoring mental health and depression levels at the re-spective universities. Accompanying the additional impor-tance placed on college rankings has been an increased num-ber of published rankings by today’s media. Joining estab-lished and recognized publications such as The U.S. Newsand World Report [8] and The Princeton Review in creat-ing rankings are up-and-coming viral media websites suchas BuzzFeed and The Huffington Post.

Rather surprisingly, the methodology used by these pub-lishers to construct such “happiness” rankings have not keptup with the swell in technology that have led to their greaterprominence. According to writers at The Princeton Review,their methodology for constructing their annual set of rank-ings comprises distributing an 83-question questionnaire touniversity students through a physical booth and throughe-mail [10]. The questions are all multiple choice, with an-swers being rankings of 1-5, with 1 being strongly disagreeand 5 strongly agree. On average, fewer than four hundredstudents at each university take the survey, and official sur-veying is only completed once every three years.

The lack of granularity in the data is troubling when con-sidering the amount of emphasis placed on their findings.For example, the The Princeton Review ’s recent “HappiestColleges” ranking appeared in headlines on numerous high-traffic websites such as The Huffington Post, College Atlas,and The University Herald. Upon deeper investigation intotheir methodology, it was found that the rankings were cal-culated solely through averaging the answers of students ineach university to the question “How happy are you?” [10]

At the same time, the conclusions of these emotional healthranking reports have great influence on their readers. Highschool students and families refer to college rankings as animportant source of information during the college decisionprocess. College students find rankings a helpful tool to un-

1

derstand outward perception regarding their university. Ad-ministrators look to rankings to evaluate their performancein regards to student mental health and formulate policy de-cisions to manage their reputation. Given the importanceplaced on these rankings, a mismatch exists between currentrankings’ accuracy and the decisions made through consult-ing these rankings.

3. RELATED WORKThere is existing academic research centered around using

natural language models to predict depression in the generalpopulation. However, only a small subset of these studiesfocus on college students. Below, we highlight such previouswork that is relevant to our study.

As early as 2004, Rude et. al. [11] conducted a study thatlooked at the language used by depressed and depression-prone college students. The paper used a linguistic approachby analyzing the diction used in essays written by depressedcollege students. This study was the first to establish thatthere is a significant difference in language use between de-pressed and non-depressed college students.

In 2006, Stephenson et. al. [14] published a study thatexamines predictors of suicidal ideation among college stu-dents. This study primarily focused on contrasting the dif-ferences in indicators of suicide between male and femalecollege students. Unlike our work, Stephenson’s study fo-cuses on suicidal ideation rather than depression. However,many of the indicators of suicide proposed by Stephensonet. al. indicate depression as well, thus making it relevantto our work.

In later years, several papers used the same linguistic ap-proach taken by Rude et. al. to analyze depression andits symptoms. However, most of these studies lacked a de-mographic focus, choosing instead to analyze depression pa-tients across the entire population. A paper by Neuman et.al. [7] used data from the Internet as well as expertise oflinguistic scholars to construct a predictive model to iden-tify depression based on a piece of writing. The model wasable to achieve an 84% classification rate. A more recentstudy by Howes, Purver, and McCabe [5] used linguistic in-dicators to track depression patients throughout an onlinetext-based therapy. They found that linguistic models couldpredict the important measures on depression with a highdegree of accuracy.

In recent years, there has been research done on applyinglinguistic models to social media data to predict depression.A study by Choudhury et. al. [3] used Twitter data to pre-dict which patients were depressed before they even receiveda formal diagnosis. The study used the Twitter data of diag-nosed depression patients one year prior to their diagnosis totest whether they could gauge a diagnosis of depression onthe individual level. The model achieved a predictive accu-racy of 72%. A recent study by Schwartz et. al. [12] furtherfine-grained this predictive model by predicting depressionscore on a continuous scale rather than simply producing aclassification of depressed versus not depressed.

Several recent studies have used social media data to an-alyze depression for only a subset of the population. Onestudy, conducted by Thompson, Poulin, and Bryan [15],used social media posts to predict suicide risk in militarypersonnel and veterans. Another study, conducted by Morenoet. al., studied disclosures of depression on Facebook by al-ready diagnosed college students. However, this study did

not focus on building a model of depression for college stu-dents. Instead, it studied how often students who are di-agnosed with depression display negative emotions on Face-book.

Additionally, the work also considers using PERMA scores,a positive psychology methodology, as an alternative formatfor validating the depression approach. PERMA is a schemedeveloped by Dr. Martin Seligman as an attempt to captureemotional well-being among five dimensions [13]. For eachdimension, a corpus of keywords (as well as their relativeweights) was developed at the University of Pennsylvania.PERMA scores are generated by aggregating the normalizedfrequencies of these keywords for a given input text. Specifi-cally, the output will has different dimensions, with each di-mension having two directions (positive and negative), anda score for each direction in each dimension1. The model forcalculating these scores has been set and established [13].

4. SYSTEM MODEL

Figure 1: Block Diagram of Full Model

Our approach, outlined in Figure 1, leverages languagefeatures from Tweets and the World Well-Being Project’s(WWBP)2 existing depression model to construct universitydepression rankings. However, this approach can be general-ized to work for a broader set of applications. In particular,these rankings can be produced over any set of groups, notjust universities. Additionally, alternative language models

1PERMA stands for Positive Emotion, Engagement, Rela-tionships, Meaning, and Accomplishment. Each of thesedimensions will have a positive direction and a negative di-rection. Hence, there are a total of 10 scores, one for eachdirection on each dimension.2The World Well-Being Project is a collaboration amongcomputer scientists and psychologists at the University ofPennsylvania, aimed at studying psychology through mod-ern machine learning techniques. For more details pleaserefer to http://wwbp.org/.

2

can be substituted for WWBP’s model to provide for moreflexibility in how depression is measured.

This more generalized approach has two main compo-nents: a language model and a data set with user-level so-cial media messages. These two components are then com-bined to produce user-level scores per the language model,and these scores are finally aggregated to produce the finalrankings. We now describe in more detail each of these fourcomponents/stages particularly for our attribute of interest,depression.

4.1 Prerequisite: A Depression ScaleThe overall model first decides on a numerical scale for

depression. This scale is used to label users in the trainingdata set on their levels of depression. It is also used by thedepression model to output depression scores for each user.

4.2 Depression ModelThe depression rankings produced in this work use WWBP’s

depression model that produces depression scores given so-cial media data. However, our approach is not restricted tothis model; other existing individual-level depression modelscan also be substituted for WWBP’s depression model.

The overall requirements for such a model are rather loose.The model must take as input training data a set of users,where each user has some text and a depression score. Themodel then takes text from a test set of users and is able toproduce a depression score for each of the users. This out-putted score should be on the same scale as the depressionscores in the training data.

4.3 Data Set ConstructionOur work aims to leverage social media language to gen-

erate depression rankings. That said, it is not required thatthe data set consist strictly of social media messages: anywritten text or combination of written texts will do. For ex-ample, for each user, one may choose to use a combinationof Facebook status updates and college application essays.The only requirement is that the text in the training andranking data sets (described below) are similar.

As with any statistical model, the quality of the results isdependent on whether the data set is large enough to pro-duce statistically significant results. However, the size ofthe data set required to produce meaningful rankings willdepend on the depression model chosen in the system im-plementation. Thus, this system model leaves it to the userto ensure that the size of the data set is sufficiently largeenough for the depression model chosen.

The required data set can be broken down into two parts.First, the model requires a training data set that will be

used to train the depression model. Here, we require thatthe training data consist of a sufficient number of users totrain the depression model. Then, for each user, the dataset should consist of sufficient text per user as well as adepression ranking on the scale described above.

Second, we need a ranking data set that will be used toproduce the final rankings. For each group that is to beranked, this data set should consist of a sufficient numberof users labeled as belonging to that group. Then, for eachuser, the data set should contain a sufficient amount of textto produce a depression score for that user.

4.4 User-Level Scores

In the third stage, we compute the depression scores foreach individual user. This is done by training the depressionmodel chosen using the training data described previously.Then, for each user in the ranking data set described in thebelow section, the text for that user is simply run throughthe depression model to compute a depression score for thatuser.

4.5 Depression Ranking OutputIn the final stage, we compute rankings for our groups

using the user-level depression scores. This simply requiresan aggregation function that takes as input the depressionscore for each user in a group and returns a single depressionscore for that group. The simplest of such functions simplytakes the average of all of the depression scores, but a moresophisticated approach that weights users based on the totalnumber of words in their text or on other qualities of theusers may also be used. The final output is, for each group,a depression score, and the groups can then be sorted onthis depression score to produce the final rankings.

5. SYSTEM IMPLEMENTATIONBelow, we discuss the details about how each stage of the

model described above is implemented.

5.1 Depression ModelThe method in the baseline depression model developed

by WWBP is called Differential Language Analysis. Differ-ential language analysis involves first working with a set oflabeled training data, choosing a set of features that wouldbest predict the labels (e.g. n-grams, topics, etc.), and thengenerating a corresponding model (e.g. Naive Bayes, Re-gressions, SVM) on that. Then, these weightings can beapplied on a novel data set through three steps: sanitizingand converting the data set into a message table, extract-ing the desired features from this message table, and thenperforming correlation analysis and visualization. This isour overarching approach when it comes to constructing apredictive model based on social media language.

WWBP LibraryWe base our work off a low-level implementation needed torun and tune our WWBP model. The model is a machinelearning library that has a interface through which a range oftasks including model creation (including a range of differentregression and classification models), feature extraction, anddata visualization can be accomplished. The library usesa MySQL database to store the necessary data and usesPython to run code that completes the necessary machinelearning tasks. The necessary parameters for each model isstored in memory as local variables in the Python methods,resulting in the necessary creation of a .pickle file for eachmodel that needs to be accessed later. This library providesa convenient interface with a relatively fast run time.

WWBP Depression ModelThis work uses the model developed by Schwartz et. al. [12]as part of the World Well-Being Project as the baselinemodel to predict individual levels of depression. The modelis trained and tested on data from 28,749 Facebook userswho have opted into a study where they complete a per-sonality questionnaire and provide access to their status up-dates between June 2009 and March 2011. The personality

3

questionnaire measures levels of depression in seven differentfacets as based off a methodology developed by CambridgeUniversity [12]. The personality survey computes an aver-age of all seven to output a depression score, termed “degreeof depression” that ranges from 0 to 12. We use this “degreeof depression” score as the depression metric for this work.

Schwartz et. al.’s [12] model uses the following featuresto output the aggregate depression score:

1. 1- to 3-grams: The relative frequency of n-gramsrestricted to those used by at least 5% of all users

2. Topics: 2000 topics derived via latent Dirichlet allo-cation (LDA) on the Facebook data, in addition to 64Linguistic Inquiry and Word Count (LIWC) categories[9]

3. Number of words: The total number of words a userhas posted

The model first applies Principal Component Analysis toreduce its feature space, and then uses an L2-penalized re-gression model to predict the depression score. The modelprovides a more nuanced prediction of depression (havingthe outcome as a scale rather just a binary output) whilestill maintaining decent accuracy. It achieved a Pearson Rvalue of 0.39 and a mean squared error of 0.78 on its out ofsample test set, which significantly outperformed the base-line of sentiment analysis.

Figures 2 and 3 are visualizations of the data set used toconstruct the regression model. The word clouds are gen-erated by computing the correlation values for each featurewith the labeled depression scores, and then emitting theunigrams and bigrams that output the highest absolute val-ues. Interestingly, the words in the data set that correlatemost strongly with depression tend to be in the first person(e.g. “I,”“myself,” etc.) whereas those most negatively cor-related with depression tend to be activity related or refersto a collective entity (e.g. “our,”“team,”“game,” etc.).

Figure 2: Words Most Negatively Correlated withDepression in the WWBP Model

Message and User Table ConversionAssuming a data set of social media messages, the first stepin the model training implementation is to convert the raw

Figure 3: Words Most Positively Correlated withDepression in the WWBP Model

messages into a well-formatted table. This requires sani-tizing the data for unsupported languages as well as label-ing features such as links and re-Tweets. Ultimately, thewell-formatted table contains the text of messages as wellas supporting metadata, such as the user id, the timestampof the message, and the geographical location of the post, ifavailable. We also create another table that contains userinformation (e.g. user id, bio) as well as a labeling to theschool to which they were mapped.

Feature ExtractionTo extract features from our messages, we tokenize the text.As part of the tokenizer, there have been modifications torecognize emoticons common to social media text (e.g. “<3”,“:-)”). From the tokenized text, we then create n-grams (se-quences of one, two, or three words), which allow greatercontext than a simple bag-of-words model. We also use lex-ical and topical features to find language characterizing de-pression. For lexica, we use LIWC (Linguistic Query andWord Count) lists . Each LIWC list of words is associatedwith a semantic or syntactic category, such as “engagement”or “leisure.” For topics, we use clusters of lexico-semanticallyrelated words as derived from latent Dirichlet allocation(LDA) [12].

In addition, we refine our features. We use a point-wisemutual information criteria, which looks at the ratio of theactual rate that two words occur together to the expectedrate that two words appear together according to chance;2-grams and 3-grams not meeting the criteria are discarded.We also limit our words and phrases to restrict to those usedby at least 5% of the sample. While longer phrases couldbe considered, computations become increasing challengingbecause as n-gram size increase, the number of combinationsincrease exponentially. Words and phrases are normalizedby the total number of words written by the user, and aretransformed using the Anscombe transformation to stabilizevariance.

Correlation AnalysisAfter feature extraction, with the training set, we run a cor-relation analysis between our features and depression scores.We use a ordinary least squares linear regression over stan-

4

dardized variables, producing a linear function and a Pear-son R value, as well as a set of weightings for each feature.These weightings can then be applied to the features ex-tracted above from the message table, which are aggregatedon a user basis, in order to get a degree of depression scorefor each user. This degree of depression score will then beaggregated at a university level in order to generate the rank-ings, to be discussed below.

5.2 Data Set ConstructionNext, we construct our data set. For each university we

wish to rank, our data set contains a set of Tweets that wereTweeted by Twitter users from that university.

Approach OverviewUnlike Facebook profiles, which track many details aboutusers such as age, university affiliation, and work history,Twitter profiles are very simple. The sign-up process con-sists of merely entering one’s name and email, and users canalso later upload a profile picture and write a very short(160 character max) bio about themselves. Without any ex-plicit age or university labels on Twitter accounts, drawingconclusions about which users are from a given university isdifficult.

For our approach, we aim to construct a data set with highprecision. In other words, for the Tweets that we find foreach school, we want a very high percent of those Tweets toactually have been Tweeted by Twitter users at that school.

To construct the data set, we make use of two obser-vations. First, we note that most colleges across Americahave a Twitter account, and many students who attend theuniversity and have Twitter accounts follow the college ac-count. Second, we note that although Twitter doesn’t di-rectly store university affiliation for each user, many studentusers choose to list their university affiliation in their bio.

Approach DetailsAs per the description of the data set construction approachoutlined in the “System Model” section, we take a four-stepapproach towards constructing our data set.

In the first step, we manually (through a Google search)find the main Twitter account for each university in our dataset.While there are many Twitter accounts containing theuniversity name, most of which are controlled by third par-ties not affiliated with the university, we track only verifiedaccounts. Verification of such accounts is done by Twitterand “establish[es] authenticity of key individuals and brandson Twitter.” [2] Because such accounts are verified, they areactually affiliated with the university, and thus are more eas-ily found by users in searches and have a larger number offollowers.

Second, for each university Twitter account found in theprevious step, we use the Twitter API [2] to find all Twitterusers with public accounts who follow the university account.The reasoning behind this step follows from the previously-made observation that students at universities usually followtheir school’s Twitter account. Then, for each of these Twit-ter accounts, again using the Twitter API [2], we pull theaccount’s bio information (if it exists).

Now, although we have information for all Twitter userswho follow the school Twitter account, we note that it isvery unlikely that all of these users are actually students atthe university. In fact, many of these users may be prospec-

tive students, fans of the school’s sports team, or faculty atthe school. Thus, in the third step, we use the bios gatheredto filter down the Twitter users to only those who attendthe university. To do so, we use a regular expression thatsearches for two components in the Twitter bio. First, welook for some affiliation with the university by looking foreither the full school name (e.g. University of Michigan) ora well-known abbreviation for that school (e.g. umich). Allsuch searches are case insensitive. However, looking for justa university affiliation is not sufficient. Alumni, parents ofstudents, faculty, and even sports fans may list the univer-sity name in their profile. Thus, to filter these out, we alsolook for a typical graduation year for 4-year students at theuniversity (i.e. a year between 2015 and 2018) or the key-word “student.” An example of such a Twitter bio would be:“UMich, Class of 2015”.

Finally, for the Twitter users found in the step above, wepull all Tweets that were made in the 2014-2015 school yearthrough use of the Twitter API. We classify Tweets madeduring the school year as those dated after August 31st,2014.

5.3 Depression Ranking OutputFrom the first two stages of the model, we are left with

a degree of depression score for each user in our data set,the number of words a user has Tweeted (since August 31st,2015), as well as a labeling to the university the user at-tends. In order to generate our desired set of comparativerankings for universities, we need some methodology for ag-gregating user scores to a university. We choose to aggre-gate user degree of depression scores using a weighted av-erage based on the number of words Tweeted, as it makessense that users who have Tweeted more words should begiven greater weight, as their degree of depression score isless volatile given the amount of data used in generatingthe score. When deciding between weighting schemes, weconsidered a logarithmic scale, square root scale, as well aslinear scale for the number of words. We select a linearscale because the model is considerably more volatile forusers with a small amount of Tweet data, and so we want tobe conservative while weighting users with little data back-ing their score. Furthermore, we choose a ceiling of 500words, at which the weighting stops increasing, because pre-vious research et. al. [12] indicates that above this, degreeof depression scores are relatively stabilized, and we did notwant to overweight users who were excessive Tweeters. Wedecide on this weighting schema rather than only consider-ing Twitter users who Tweeted a certain threshold becausewe believe that even users who seldom Tweet provide anindication of overall school well-being, and we do not wantto arbitrarily filter down our set of data. Thus, the formulaused for aggregating user scores to the university level is:

DepScore =∑

user∈university

scoreuser ×Wuser

where Wuser = min(1, count(words)/500)

From here, we rank universities according to this aggregatedepression score in order to generate our results.

6. RESULTSThe primary result is a depression ranking of the top 25

academic universities as chosen by the U.S. News and World

5

Report in 2014 [8]. The exact ranking is seen in Figure 4.The universities have 409 students on average in the data set.Rice University has the lowest number of students (88) andUniversity of Southern California has the highest numberof students (827). We note that the California Institute ofTechnology (CalTech) is removed from the ranking. This isbecause the data set construction process did not generateenough students (only 26) from CalTech for the score to bemeaningful. The scores ranged from 2.130 for Duke, theleast depressed school according to our rankings, to 2.268for Penn, which tops our depression rankings.

7. ANALYSIS OF RESULTSInterestingly, we see some trends and surprising results

emerging from our rankings. At the top of the rankingsare schools that appear to have a greater emphasis on pre-professional development. The University of Pennsylvania,University of California Los Angeles, Carnegie Mellon, EmoryUniversity, Johns Hopkins University, and the University ofVirginia share similarities in that they all have a focus onundergraduate pre-professional programs; all these schoolshave undergraduate engineering programs, all but UCLAhave undergraduate business programs, and all but EmoryUniversity have undergraduate engineering programs. Tocompare, Duke and Stanford, two prestigious schools at thelow end of the depression rankings, only offer a school in hu-manities as well as in engineering, and over 80% of studentsare enrolled in their respective school of arts and sciences,per their respective school websites.

Furthermore, we see schools at the lower end of the rank-ings tend to have strong athletic programs and a sense ofschool spirit. For example, Duke University has a religiousbasketball following, Notre Dame basketball and football,and Stanford football, among other sports they excel at.

Schools with a heavier emphasis around pre-professionaldevelopment (e.g. University of Pennsylvania, Universityof California Los Angeles, Johns Hopkins University) tendto have a higher depression score in our ranking, whereasschools with strong athletic programs tend to rank muchlower (e.g. Duke University, Stanford University, Universityof Notre Dame).3

Additionally, Cornell appears surprisingly low on our rank-ings (ranked 16th, 6th among Ivy League schools), as themedia often portrays either Cornell or Yale (ranked 7th, sec-ond among Ivy League schools) as the most depressed IvyLeague university. We believe that Cornell’s lower than ex-pected ranking when compared to public perception may bea result of poor publicity relating to the campus. Publicperception may be negative due to the sensationalized re-porting of Cornell suicides, which occur through jumpingoff a bridge into the gorges. The Huffington Post supportsour findings that Cornell does not have an above-averagesuicide rate when compared to other universities. [4]

In addition to the previous ranking, we use the samemethodology to generate a set of depression rankings forthe largest U.S. universities in terms of student enrollment,as per the Department of Education[16]. Additionally, we

3We note that our Tweets were gathered up until the be-ginning of March, prior to the beginning of the 2015 NCAAMarch Madness Tournament. Thus, Duke’s NCAA Men’sBasketball Championship win as a one time event did notdeflate their depression scores, although their performanceduring the regular season may have played a factor.

generate PERMA scores as well as PERMA rankings for thetop 25 academic universities as a parallel ranking in orderto validate our depression rankings, which we discuss below.Please refer to the appendix for these outputs.

Figure 4: Depression Ranking for Top 25 AcademicUniversities

8. EVALUATION OF RESULTSThere currently exists no established set of university de-

pression rankings that are widely accepted amongst the re-search community. As a result, we are unable to providea benchmark to evaluate the results of this work. Conse-quently, we resort primarily to human evaluation to evalu-ate the two main components of our system: the depressionmodel and the data set mapping. Furthermore, we per-form a correlation analysis between the depression rankingsproduced by our model and happiness rankings backed byexisting work in psychology.

Depression ModelTo evaluate the depression model, we create a web appli-cation that displays two Twitter users in our data set andasks testers to identify which user appears more depressed.For each pair of Twitter users, the web application ensuresthat one user scores high on our depression model (score >2.7) and thus exhibits traits of depression according to themodel, while the other user scores low on our model (score

6

< 2.3). The tester then examines the Tweets of each ofthe two users and evaluates which user appears to be moredepressed based on the Tweets displayed. We compare thehuman results against the outputs of the depression model,where the depression model chooses the user with the higherdepression score as more depressed.

In these results, when the depression model and the hu-man agree on the more depressed user, we have a concor-dant pair. In the opposite case, we have a discordant pair.From these human-produced depression ranking evaluations,we take the concordant and discordant pairs to compute aKendall’s Tau coefficient for our model using the equation:

τ =nc − nd

12n(n− 1)

where nc is the number of concordant pairs, nd is the num-ber of discordant pairs, and n is the total number of pairs inthe test set. This statistic is commonly used to measure theassociation between two measured quantities, and it rangesbetween −1 ≤ τ ≤ 1. Our model yields a τ coefficient of0.651, demonstrating a highly positively correlation betweenhuman evaluation and our model outputs. Finally, the re-sults can be used to calculate a p-value (p) for our model.We calculated a p-value of 0.088 for our model, using thefollowing equation:

p =6(nc − nd)√n(2n+ 5)/2

Validation with PERMAThe PERMA model is provided to us by WWBP, and werun the model against our Twitter data set for the posi-tive emotion element. After running the data, we calculatethe correlations between depression, positive emotion (PosP), negative emotion (Neg P), and a standardized metric forhappiness, which we calculate as the Z-score for Pos P in thesample minus the Z-score for Neg P in the sample (Pos P −Neg P Z). Professor Martin Seligman, the father of positivepsychology, previously identified in his work the correlationof depression to happiness as being -0.35 [12]. Our own cor-relation between depression rankings and standardized hap-piness is -0.38. Furthermore, our depression rankings showa very low negative correlation with positive emotion anda moderate correlation with negative emotion, a result thatis supported by Seligman’s previous work [12]; this lendsfurther confidence to the methodology in this work.

Correlations Pos P Neg P Pos P − Neg P Z

Dep Rank −0.095 0.394 −0.376

Pos P Rank −0.219 .665

Neg P rank −0.820

Table 1: PERMA Correlation with DepressionRankings

Validation with Other MetricsAdditionally, we have computed the correlation between uni-versity depression scores (as well as the PERMA scores) withsome simple, easy-to-find metrics commonly used in rank-ing universities in terms of academic prestige [8] [10]. The

values (shown in Table 2) match common intuition. We seethat retention rate, defined as the percentage of freshmenthat enroll as sophomores in the same university, is nega-tively correlated with the model’s depression score. Thisis expected, as a higher retention rate indicates that morestudents are returning to school after spending a year atthe university. Interestingly, the acceptance rate of the uni-versity correlates positively with the depression score. Thisseems to indicate that students at exclusive universities areless depressed. This is supported further by the correlationbetween depression score and US News Ranking, which canserve as a proxy for a university’s prestige. In addition, wenote that university enrollment is correlated with depres-sion. The average depression score of the top 25 academicuniversities is lower than the score for the 40 largest schools,which supports the correlations in Table 2.

Correlations Dep Score Pos P Score Neg P Score

Tuition and fees -0.103 0.312 -0.157

Total enrollment 0.157 0.047 -0.103

2013 accept. rate 0.339 -0.398 0.326

Retention rate -0.312 0.376 -0.346

US News ranking 0.262 0.481 -0.278

Table 2: Other Factors’ Correlation with DepressionRankings

Data Set MappingIn constructing our data set, we use the approach and imple-mentation outlined in the previous sections to find Tweetsfor the 40 largest universities in the United States, as speci-fied by the total enrollment reported by U.S. Department ofEducation [16], as well as the top 25 academic universitiesas ranked by U.S. News and World Report [8]. We focuson the evaluation of the data set for the top 25 academicuniversities, as this is the selection from which our primaryranking is generated.

The data set consists of, on average, 409 Twitter users perschool. Additionally, the data set has, on average, 145,000Tweets per school. More detailed statistics about the results(on the school level) are shown in Table 2 below:

Statistic min average max

# of Twitter users 88 409 827

# of Tweets 11,682 144,688 591,645

Table 3: Results of data set Construction

RecallAs previously mentioned, our data set construction approachaims for high precision. However, as expected, there is atrade-off between precision and recall. We can roughly mea-sure our recall using the following calculations. Using theU.S. Department of Education’s statistics, we find that onaverage, the 25 academic schools have about 17,601 studentsper school. Additionally, a study by Digiday [6] reports thatas of November 2013, approximately 43.7% of college stu-dents are on Twitter. Using this, we calculate that for the

7

25 top academic universities, there are, on average, approxi-mately 7,692 Twitter users per university. Because our dataset has only 409 users per university, we obtain a recall ofapproximately 5.3%.

Although our data set has a very low recall and only cap-tures a small fraction of Tweets for each university, it is asufficient size for our model. Our model requires a mini-mum of 10,000 Tweets per school to produce meaningful re-sults [12]. All 25 schools in our ranking have at least 10,000Tweets, with the average being much higher (over 100,000Tweets).

PrecisionTo evaluate the quality of the data set mapping phase, weconstruct a web application that provides an interface fortesters, a selected group of colleagues at Penn, to reviewa sample of our data and verify the accuracy of the map-ping between Twitter bios and universities. The webpagedisplays the Twitter biography for a randomly chosen userfrom our data set along with the university to which thatuser was mapped. Testers on the website then use this in-formation to determine whether the biography identifies theuser as a current student at the listed university. Based onthis validation, our university data mapping yields an accu-racy of 86.9% from a sample of 390 Twitter user bios.

DrawbacksThe ideal data set for a university either consists of allTweets that were Tweeted by users at that university, ora random subset of those Tweets. However, because of ourmethod of finding Twitter users at each university, there isa systematic bias in which Tweets are captured by our dataset. This bias is towards those twitter users who list theirschool and graduation year on their Twitter bio and followthe school Twitter account. One may argue that those whoare more likely to list their school affiliation in their Twit-ter bio are less likely to be depressed, or one may argue theopposite. However, regardless of such arguments, we makethe underlying assumption that any such biases introducedinto our data have an equal effect on the data all universitiesin our data set. Therefore, these biases do not impact ourresults.

9. FUTURE WORKThere are several useful extensions of our work that may

be explored further.First, our novel message mapping approach is a useful

way to label Twitter profiles with metadata about univer-sity affiliations. As such, we were able to build a data setconsisting of Twitter users for each university. No such dataset currently exists; thus, it may be useful to explore furtherapplications of such a data set.

Additionally, our rankings looked at select groups of schools,such as academically prestigious undergraduate institutionsand the largest schools in the United States. For a completeset of rankings, we need to incorporate other universitiesinto our data set.

Furthermore, a limiting constraint in our work is the num-ber of users mapped to each university. For most universi-ties, the mapping technique captures a sufficient number ofTwitter profiles for a university to perform our analysis de-tailed in the work. However, in our sampling of the top aca-demic universities, there is one outlier university, CalTech,

which is only mapped to a few dozen users. This is becauseCalTech has a very small student body, with an undergrad-uate enrollment of less than 1,000 students in 2012[1]. Toinclude outliers in rankings and analysis, more sophisticatedmethods for university mapping, which improve recall with-out a significant trade-off in precision, should be developed.

The framework developed in our work may also be ex-tended for depression analysis at institutions other than uni-versities. For example, the system developed may be utilizedas a Human Resources tool to evaluate worker morale basedon language used in e-mails or enterprise social media plat-forms such as Yammer. This would allow such companies tonot only increase employee satisfaction but also improve inareas such as worker retention.

10. ETHICSAlthough the depression model is validated with some de-

gree of confidence, it cannot be used as a tool to diagnoseindividuals for depression. As discussed in prior sections ofthe work, the amount that an individual writes on socialmedia will affect their depression score. Furthermore, thelanguage that a single individual uses may not be enough toindicate their mental well-being. The depression model hasnot been verified medically and cannot supplant the opinionof professional services

Another concern is that the data is collected from a pub-licly available source, Twitter, which therefore is not anonymized.It is possible to identify a user based on data we have col-lected, such as their user id, biography, or Tweets. Themodel connects a user with sensitive information about theirmental well-being. As a result, our data set must be anonymizedand insights drawn from the data, especially about individ-uals, must be filtered to avoid defamation and ensure userconfidentiality.

On a final note, while the framework developed in thiswork may be applied in studying depression beyond the uni-versity level, there are privacy and security concerns withregards to collecting user data. For example, if a depressionmodel is applied on employee e-mails, there will likely bepublic concern about using this data to draw conclusions onthe mental state of employees.

11. CONCLUSIONIn this work, we create a set of university depression rank-

ings using Twitter data. We developed a novel approach ofmapping student Twitter accounts to universities to con-struct a data set of college student Tweets. Then, usingdifferential language analysis and machine learning, we gen-erate individual depression scores and aggregate them to theuniversity level to create our depression rankings.

Our data set of college student Tweets and depressionmodel is a truly useful tool for understanding and rankingdepression for students at universities. We find on average409 students for each of the top 25 academic universities withan accuracy of 86.9%. Furthermore, we develop a model thathas a p-value of 0.088 measured against human evaluation.While there is still room for improvement in our system, wehave built a strong foundation for understanding depressionacross universities and for conducting other sets of rankingsand analysis at the university level. Student well-being is aserious and relevant topic on college campuses, and we hopethat our model and insights about depression provide value

8

for and help for students, faculty, and administrators.

12. REFERENCES[1] CalTech Undergraduate Admissions Facts and Stats,

2013.

[2] Twitter, 2014.

[3] Munmun De Choudhury, Michael Gamon, ScottCounts, and Eric Horvitz. Predicting depression viasocial media. In Emre Kiciman, Nicole B. Ellison,Bernie Hogan, Paul Resnick, and Ian Soboroff, editors,ICWSM. The AAAI Press, 2013.

[4] Rob Fishman. Cornell suicides: do ithaca’s gorgesinvite jumpers?, 2010.

[5] Christine Howes, Matthew Purver, and Rose McCabe.Linguistic indicators of severity and progress in onlinetext-based therapy for depression. In Proceedings ofthe Workshop on Computational Linguistics andClinical Psychology: From Linguistic Signal to ClinicalReality, pages 7–16, Baltimore, Maryland, USA, June2014. Association for Computational Linguistics.

[6] John Mcdermott. Facebook losing its edge amongcollege-aged adults, 2014.

[7] Yair Neuman, Yohai Cohen, Dan Assaf, and GabiKedma. Proactive screening for depression throughmetaphorical and automatic text analysis. ArtificialIntelligence in Medicine, 56(1):19–25, 2012.

[8] US News. US News and World Report’s AnnualCollege Rankings, 2014. Web. Accessed 19 Oct 2014.

[9] James W. Pennebaker, C.K. Chung, M. Ireland,A. Gonzales, and R.J. Booth. The development andpsychometric properties of liwc2007. 2007. Austin,TX, LIWC. Net.

[10] Princeton Review. Surveying Students: How It Works| Princeton Review, 2014. Web. Accessed 28 Apr 2015.

[11] Stephanie Rude, Eva-Maria Gortner, and JamesPennebaker. Language use of depressed anddepression-vulnerable college students, 2004.

[12] H. Andrew Schwartz, Johannes Eichstaedt,Margaret L. Kern, Gregory Park, Maarten Sap, DavidStillwell, Michal Kosinski, and Lyle Ungar. Towardsassessing changes in degree of depression throughfacebook. In Proceedings of the Workshop onComputational Linguistics and Clinical Psychology:From Linguistic Signal to Clinical Reality, pages118–125, Baltimore, Maryland, USA, June 2014.Association for Computational Linguistics.

[13] Martin E. P. Seligman. Flourish: A Visionary NewUnderstanding of Happiness and Well-being. AtriaBooks, reprint edition, 2 2012.

[14] Hugh Stephenson, Judith Pena-Shaff, and PriscillaQuirk. Predictors of college student suicidal ideation:Gender differences, 2006.

[15] Paul Thompson, Craig Bryan, and Chris Poulin.Predicting military and veteran suicide risk: Culturalaspects. In Proceedings of the Workshop onComputational Linguistics and Clinical Psychology:From Linguistic Signal to Clinical Reality, pages 1–6,Baltimore, Maryland, USA, June 2014. Association forComputational Linguistics.

[16] National Center for Education U.S. Department ofEducation. Selected statistics for degree-granting

postsecondary institutions enrolling more than 15,000students in 2012, by selected institution and studentcharacteristics: Selected years, 1990 through 2011-12,May 2014.

9

APPENDIXA. ADDITIONAL FIGURES

Figure 5: Depression Rankings for Top 25 AcademicSchools

10

Figure 6: Depression Rankings for Top 40 LargestSchools

11

Figure 7: Depression and PERMA Rankings for Top25 Academic Schools

12

university depression rankings using twitter datacse400/cse400_2014_2015/reports/0… · university...

Documents