Sharad Goel, Ashton Anderson, Jake Hofman, Duncan J. Watts (2015) The Structural Virality of Online Diffusion. Management Science Published online in Articles in Advance 22 Jul 2015

MANAGEMENT SCIENCEArticles in Advance, pp. 1–17ISSN 0025-1909 (print) � ISSN 1526-5501 (online)

© 2015 INFORMS

The Structural Virality of Online Diffusion

Sharad Goel, Ashton AndersonStanford University, Stanford, California, 94305 {[email protected], [email protected]}

Jake Hofman, Duncan J. WattsMicrosoft Research, New York, New York 10016 {[email protected], [email protected]}

Viral products and ideas are intuitively understood to grow through a person-to-person diffusion processanalogous to the spread of an infectious disease; however, until recently it has been prohibitively difficult

to directly observe purportedly viral events, and thus to rigorously quantify or characterize their structuralproperties. Here we propose a formal measure of what we label “structural virality” that interpolates betweentwo conceptual extremes: content that gains its popularity through a single, large broadcast and that whichgrows through multiple generations with any one individual directly responsible for only a fraction of the totaladoption. We use this notion of structural virality to analyze a unique data set of a billion diffusion events onTwitter, including the propagation of news stories, videos, images, and petitions. We find that across all domainsand all sizes of events, online diffusion is characterized by surprising structural diversity; that is, popular eventsregularly grow via both broadcast and viral mechanisms, as well as essentially all conceivable combinationsof the two. Nevertheless, we find that structural virality is typically low, and remains so independent of size,suggesting that popularity is largely driven by the size of the largest broadcast. Finally, we attempt to replicatethese findings with a model of contagion characterized by a low infection rate spreading on a scale-free network.We find that although several of our empirical findings are consistent with such a model, it fails to replicate theobserved diversity of structural virality, thereby suggesting new directions for future modeling efforts.

Keywords : Twitter; diffusion; viral mediaHistory : Received August 14, 2013; accepted November 26, 2014, by Lorin Hitt, information systems.

Published online in Articles in Advance.

1. IntroductionWhen a piece of online media content—say, a video,an image, or a news article—is said to have “goneviral,” it is generally understood not only to haverapidly become popular, but also to have attained itspopularity through some process of person-to-personcontagion, analogous to the spread of a biologicalvirus (Anderson and May 1991). In many theoreticalmodels of adoption (Coleman et al. 1957, Bass 1969,Mahajan and Peterson 1985, Valente 1995, Bass 2004,Toole et al. 2012), in fact, this analogy is made explicit:an “infectious agent”—whether an idea, a product, ora behavior—is assumed to spread from “infectives”(those who have it) to “susceptibles” (those who donot) via some contact process, where susceptibles canthen be infected with some probability.1 Both intu-itively and also in formal theoretical models, there-fore, the notion of viral spreading implies a rapid,

1 Even models of social contagion that do not correspond pre-cisely to the mechanics of biological infectious disease (for example,“threshold models” (Granovetter 1978) make different assump-tions regarding the nonindependence of sequential contacts withinfectives (Lopez-Pintado and Watts 2008)) assume some form ofperson-to-person spread (Watts 2002, Kempe et al. 2003, Dodds andWatts 2004).

large-scale increase in adoption that is driven largely,if not exclusively, by peer-to-peer spreading. Clearly,however, viral spreading is not the only mechanismby which a piece of content can spread to reach a largepopulation. In particular, mass media or marketingefforts rely on what might be termed a “broadcast”mechanism, meaning simply that a large number ofindividuals can receive the information directly fromthe same source. As with viral events, broadcastscan be extremely large—the Superbowl attracts over100 million viewers, while the front pages of themost popular news websites attract a similar num-ber of daily visitors—and hence the mere observationthat something is popular, or even that it became sorapidly, is not sufficient to establish that it spread ina manner that resembles social contagion.

Figure 1 schematically illustrates these two styl-ized modes of distribution—broadcast and viral—where the former is dominated by a large burst ofadoptions from a single parent node, and the lat-ter comprises a multigenerational branching processin which any one node directly “infects” only a fewothers. Although the stylized patterns in Figure 1are intuitively plausible and also easily distinguish-able from one another, differentiating systematically









Goel et al.: The Structural Virality of Online Diffusion2 Management Science, Articles in Advance, pp. 1–17, © 2015 INFORMS

Figure 1 A Schematic Depiction of Broadcast vs. Viral Diffusion,Where Nodes Represent Individual Adoptions and EdgesIndicate Who Adopted from Whom

between broadcast and viral diffusion requires one,in effect, to characterize the fine-grained structure ofviral diffusion events. Yet, in spite of a large theoret-ical and empirical literature on the diffusion of infor-mation and products, relatively little is known abouttheir structural properties, in part because the requi-site data have not been available until very recently,and in part because the concept of virality itself hasnot been formulated previously in an explicitly struc-tural manner. Classical diffusion studies (Colemanet al. 1957, Rogers 1962, Bass 1969, Valente 1995,Young 2009, Iyengar et al. 2010), for example, typi-cally had access to only aggregate diffusion data, suchas the cumulative number of adoptions of a prod-uct, technology, or idea over time (Fichman 1992). Insuch cases, the observation of an S-shaped adoptioncurve—indicating a period of rapid growth followedby saturation—is typically interpreted as evidenceof social contagion (Rogers 1962); however, S-shapedadoption curves may also arise from broadcast dis-tribution mechanisms such as marketing or massmedia (Van den Bulte and Lilien 2001). Compound-ing the difficulty, real diffusion events are unlikelyto conform precisely to either of these conceptualextremes. In a highly heterogeneous media environ-ment (Walther et al. 2010, Wu et al. 2011), where anygiven piece of content can spread via email, blogs, andsocial networking sites as well as via more traditionaloffline media channels, one would expect that popu-lar content might have benefited from some possiblycomplicated combination of broadcasts and interper-sonal spreading.

To understand the underlying structure of anevent, therefore, one must reconstruct the full adop-tion cascade, which in turn requires observing bothindividual-level adoption decisions and also the socialties over which these adoptions spread. Only recentlyhave data satisfying these requirements become avail-able, as a result of online behavior such as blog-ging (Adar and Adamic 2005, Yang and Leskovec2010), e-commerce (Leskovec et al. 2006), multiplayergaming (Bakshy et al. 2009), and social network-ing (Sun et al. 2009, Yang and Counts 2010, Bakshyet al. 2011, Petrovic et al. 2011, Goel et al. 2012, Hoangand Lim 2012, Tsur and Rappoport 2012, Kupavskiiet al. 2012, Jenders et al. 2013, Ma et al. 2013).

A second empirical challenge in measuring thestructure of diffusion events, which has in fact beenhighlighted by these recent studies, is that the vastmajority of cascades—over 99%—are tiny and termi-nate within a single generation (Goel et al. 2012).Large and potentially viral cascades are therefore nec-essarily very rare events; hence, one must observe acorrespondingly large number of events to find justone popular example, and many times that numberto observe many such events. As we will describelater, in fact, even moderately popular events occurin our data at a rate of only about one in a thousand,whereas “viral hits” appear at a rate closer to one in amillion. Consequently, to obtain a representative sam-ple of a few hundred viral hits—arguably just largeenough to estimate statistical patterns reliably—onerequires an initial sample on the order of a billionevents, an extraordinary data requirement that is dif-ficult to satisfy even with contemporary data sources.

In this paper, we make three distinct but relatedcontributions to the understanding of the structureof online diffusion events. First, we introduce a rig-orous definition of structural virality that quantifiesthe intuitive distinction between broadcast and viraldiffusion and allows for interpolation between them.As we explain in more detail below, our definition iscouched exclusively in terms of observed patterns ofadoptions, not on the details of the underlying gen-erative process. Although this approach may seemcounterintuitive in light of our opening motivation(which does make reference to generative models),the benefit is that the resulting measure does notdepend on any modeling assumptions or unobservedproperties, and hence can be applied easily in prac-tice. Also importantly, by treating structural virality asa continuously varying quantity, we skirt any categor-ical distinctions between completely “broadcast” and“viral” events, allowing instead for open-ended andfine-grained distinctions between these two extremes;that is, events can be more or less structurally viralwithout imposing any particular threshold for becom-ing or “going” viral.

Our second contribution is to apply this measure ofstructural virality to investigate the diffusion of nearlya billion news stories, videos, pictures, and petitionson the microblogging service Twitter. To date, moststudies directly documenting person-to-person diffu-sion have been limited to a small set of highly viralproducts (Liben-Nowell and Kleinberg 2008, Dowet al. 2013), leaving open the possibility that suchhand-selected events are astronomically rare and notrepresentative of viral diffusion more generally. Incontrast, by systematically exploring the structuralproperties of a billion events on Twitter, we aim toestimate the frequency of structurally viral cascades,quantify the diversity in the structure of cascades,








Goel et al.: The Structural Virality of Online DiffusionManagement Science, Articles in Advance, pp. 1–17, © 2015 INFORMS 3

and investigate the relationship between cascade sizeand structure. It could be, for example, that the mostpopular content is also extremely viral, but equally itcould be that successful products are mostly drivenby mass media (i.e., a single large broadcast) or bysome combination of broadcasts and word of mouth.Depending on the relative importance of broadcastversus viral diffusion in driving popularity, that is, therelationship between popularity and structural viral-ity could be positive (larger events are dominatedby viral spreading), negative (larger events are dom-inated by broadcasts), or neither (all events regard-less of size exhibit a similar mix of broadcasts andvirality, which scale together). Applying our struc-tural virality measure to a representative sample ofsuccessful cascades, we find evidence for the thirdpossibility, namely, that the correlation between pop-ularity and virality is generally low. Moreover, for anygiven size (equivalent popularity), structural viralityis extremely diverse: cascades can range between pure“broadcasts,” in the sense that all adopters receive thecontent from the same source, and highly “viral,” inthe sense of comprising multigenerational branchingstructures.

The third contribution of this paper is to compareour empirical observations of cascade structure to pre-dictions from a series of simple generative models ofdiffusion. Specifically, we conduct large-scale simula-tions of a simple disease-like contagion model, similarto the original Bass (1969) model of product adop-tion, on a network comprising 25 million nodes. In thesimplest variant, we assume that the infectiousness ofthe “disease” is a constant, and the network on whichit spreads is an Erdos–Rényi (ER) random graph. Insuccessively more complicated variants, we allow theinfectiousness to vary, or the network to be “scalefree” (i.e., where the number of neighbors can varyfrom tens to tens of millions), or both. Because largediffusion events are so rare, we also conduct on theorder of 1 billion simulations per parameter setting,necessitating over 100 billion simulations in total. Wefind that although our simplest models are incapableof replicating even the most general features of ourempirical data, a still-simple model comprising con-stant infectiousness and scale-free degree distributioncan capture many, but not all, of the observed fea-tures. We conclude with some suggestions for futuremodeling efforts.

2. Defining Structural ViralityWe now turn to our first goal of defining structuralvirality. Before proceeding, we reemphasize that ournotion of structural virality is intended to comple-ment, not substitute for, the many existing genera-tive models of viral propagation and their associated

parameters (Bass 1969, Granovetter 1978, Watts 2002,Kempe et al. 2003, Dodds and Watts 2004). To clar-ify, generative models attempt to describe the under-lying diffusion mechanism itself—for example, as afunction of the intrinsic infectiousness of the objectthat is spreading, or of the properties of the con-tact process or the network over which the diffusionoccurs, or of the timescales associated with adoption.By contrast, our notion of structural virality is con-cerned exclusively with characterizing the structureof the observable adoption patterns that arise fromsome unobserved generative process. Naturally, theparticular value of structural virality associated withsome event will in general depend on the underlyinggenerative process—as indeed we will demonstratein §5, where we introduce and study several suchmodels. Importantly, however, our desired definitionof structural virality should not depend on these par-ticulars. In other words, regardless of what contagionprocess is (assumed to be) responsible for some pieceof content spreading or what network it is spreadingover, the end result is some pattern of adoptions thatexhibits some structure, and our goal is to character-ize a particular property of that structure.

Recalling also that our goal is to disambiguatebetween the broadcast and multigenerational branch-ing schematics depicted in Figure 1, we first lay outsome intuitively reasonable criteria that we wouldlike any such metric to exhibit. First, for a fixed totalnumber of adoptions in a cascade, structural viral-ity should increase with the branching factor of thestructure: specifically, it should be minimized for thebroadcast structure on the left of Figure 1 and shouldbe relatively large for structures with a high branch-ing factor, as on the right of Figure 1. Second, fora fixed branching factor, structural virality shouldincrease with the number of generations (i.e., depth)of the cascade; that is, all else equal, larger branch-ing structures should be more structurally viral thansmaller ones. Finally, and in contrast with multigener-ational branching structures, larger broadcasts shouldnot be any more structurally viral than smaller broad-casts; hence we require that, for the extreme case ofa pure broadcast, structural virality be approximatelyindependent of size.

A natural choice for such a metric is simplythe number of generations, or depth, of the cas-cade. Indeed, after size, depth is one of the mostwidely reported summary statistics of diffusion cas-cades (Liben-Nowell and Kleinberg 2008, Goel et al.2012, Dow et al. 2013). One problem with depth, how-ever, is that a single, long chain can dramatically affectthe measure. For example, a large broadcast with justone, long, multigenerational branch has large depth,even though we would not intuitively consider it to bestructurally viral. To correct for this issue, one could








Goel et al.: The Structural Virality of Online Diffusion4 Management Science, Articles in Advance, pp. 1–17, © 2015 INFORMS

instead consider the average depth of nodes (i.e., theaverage distance of nodes from the root). This averagedepth measure alleviates the problem of a handful ofnonrepresentative nodes skewing the metric, and intu-itively distinguishes between broadcasts and multi-generational chains. Even this measure, however, failsin certain cases. Notably, if an idea or product tra-verses a long path from the root and then is broadcastout to a large group of adopters, the correspondingcascade would have high average depth (since mostadopters are far from the root) even though mostadoptions in this case are the result of a single influ-ential node.

Addressing the shortcomings of both depth andaverage depth, we focus our attention on a classi-cal graph property studied originally in mathemat-ical chemistry (Wiener 1947), where it is known asthe “Wiener index.” Specifically, we define structuralvirality �4T 5 as the average distance between all pairsof nodes in a diffusion tree T ; that is, for n> 1 nodes,

�4T 5=1

n4n− 15





dij1 (1)

where dij denotes the length of the shortest pathbetween nodes i and j .2 Equivalently, �4T 5 is the aver-age depth of nodes, averaged over all nodes in turnacting as a root.

Our metric �4T 5 provides a continuous measure ofstructural virality, with higher values indicating thatadopters are, on average, farther apart in the cas-cade, and thus suggesting an intuitively viral diffu-sion event. In particular, as with depth and averagedepth, over the set of all trees on n nodes �4T 5 is min-imized on the star graph (i.e., the stylized broadcastmodel in Figure 1) where �4T 5≈ 2. Moreover, a com-plete k-ary tree (as in Figure 1 with k = 2) has struc-tural virality approximately proportional to its height;hence, structural virality will be maximized for struc-tures that are large and that become that way throughmany small branching events over many generations.3

Although �4T 5 satisfies some basic requirementsof theoretical plausibility, as with the other candi-date measures we discussed it is possible to constructhypothetical examples for which the correspondingnumerical values are at odds with the motivatingintuition. For example, a graph comprised of two starsconnected by a single, long path has large �4T 5 but

2 Naive computation of �4T 5 requires O4n25 time; however, as dis-cussed in Appendix B, a more sophisticated approach yields alinear-time algorithm (Mohar and Pisanski 1988), facilitating com-putation on very large cascades.3 Somewhat more precisely, for any branching ratio k << n, �4T 5increases with size n, whereas for k ≈ n (i.e., pure broadcasts) itdoes not; hence, increasing popularity corresponds to increasingstructural virality only when it arises from “viral” spreading, notmerely from larger broadcasts.

would not intuitively be considered viral. Whether ornot such pathological cases appear with any meaning-ful frequency is, however, largely an empirical matter,and hence the utility of the metric must ultimately beevaluated in the context of real examples, which wediscuss in detail below as well as in Appendix B.

3. Data and MethodsOur primary analysis is based on approximately 1 bil-lion diffusion events on Twitter, where an event con-stitutes the independent introduction of a piece ofcontent into the social network—including videos,images, news stories, and petitions—along with allsubsequent repostings of the same item.4 Specifically,we include in our data all tweets posted on Twitterthat contained URLs pointing to one of several pop-ular websites over a 12 month period, from July 2011to June 2012.5 In total, we observe roughly 622 mil-lion unique pieces of content; however, because indi-vidual pieces of content can be posted by multipleusers, we observe approximately 1.2 billion “adop-tions” (i.e., posting of content). Although our dataare not a total sample of Web content that is sharedon Twitter,6 they do include the vast majority andhence are essentially unbiased at least with respectto Tweets linking to Web content.7 Importantly forour conclusions, our sample also exhibits consider-able diversity both with respect to production andconsumption. For example, a typical online videois likely to have been produced and distributed by

4 We use the term “reposting” rather than the more conventional“retweet” because individuals frequently repost content that theyreceive from another user without using the explicit retweet func-tionality provided by Twitter, or even acknowledging the source ofthe content.5 For news those websites include,,,,,,,,,, and For video they include,,,,,,,,, and For images they,,,,,,,,, and For peti-tions they include,, URLs and redirects were dereferenced from original tweets, andextraneous query parameters were removed from URLs to identifymultiple versions of identical content. To avoid left censoring ofour data (i.e., missing the initial postings of a URL), we look foroccurrences of the URLs during the month prior to our analysisperiod and only include in our sample instances where the firstobservation does not appear before July 1, 2011. To avoid right cen-soring, we restrict to tweets introduced prior to June 30, 2012, butcontinue tracing the diffusion of these tweets through July 31, 2012.7 It is of course possible that Tweets containing links to Web contentare systematically different from other Tweets in ways that mightaffect our conclusions. For this reason, in Appendix D we conducta separate analysis of tweets containing long hashtags, which areunlikely to diffuse outside of Twitter, finding qualitatively similarresults.








Goel et al.: The Structural Virality of Online DiffusionManagement Science, Articles in Advance, pp. 1–17, © 2015 INFORMS 5

an amateur videographer uploading his or her ownwork onto YouTube, whereas an article appearing ina major news outlet was likely written by a profes-sional reporter. Moreover, the experience of watchinga video is quite distinct from that of reading a newsarticle, both in terms of the time and effort requiredon the part of the consumer and also their goals—for example, to be entertained versus informed—indoing so. Due in part to these qualitative differenceson both the supply and also demand sides of the mar-ket for media, we find large quantitative differences inthe frequency of the four domains; specifically, imagesand videos are far more numerous than news stories,and petitions are by far the least numerous. For sim-ilar reasons, therefore, one might also expect quali-tatively distinct sharing mechanisms to dominate indifferent domains, leading to different patterns bothof popularity and also structural virality.

To evaluate the structure of online diffusion, foreach independent introduction of a unique piece ofcontent in our data we construct a corresponding dif-fusion “tree” that traces each adoption back to a sin-gle “root” node, namely, the user who introducedthat particular piece of content.8 Specifically, for eachobservation of a URL whose diffusion we seek to trace,we record (1) the adopter (i.e., the identity of the userwho posted the content); (2) the adoption time (i.e.,the time at which the content was posted); and (3) theidentities of all users the adopter follows—hereafterreferred to as the adopter’s “friends”—from whom theadopter could conceivably have learned about the con-tent. For each such event, we first determine whetherat least one of the adopter’s friends adopted the samepiece of content previously. If no such friend exists,then the adopter is labeled a “root” of the resultingdiffusion tree; otherwise, the friend who adopted thecontent most recently before the focal adopter—andwho is most likely to have exposed the focal userto the content—is labeled the focal adopter’s “par-ent.” Although there is at times genuine ambiguity indetermining the proximate cause of an adoption, inmany cases adopters explicitly credit another individ-ual in their tweet, allowing us to accurately infer anadopter’s parent in approximately 95% of instances(see Appendix C for details of the tree constructionalgorithm and the associated evaluation procedure).

4. ResultsConsistent with previous work (Bakshy et al. 2011,Goel et al. 2012), we find that the average size ofthese diffusion trees (also referred to interchangeably

8 Although diffusion trees are in reality dynamic objects, meaningthat they grow over time as new adoptions take place, here wetreat them as static objects representing the final state of a givendiffusion process.

Figure 2 Distribution of Cascade Sizes on a Log–Log Scale,Aggregated Across the Four Domains We Study:Videos, News, Pictures, and Petitions








1 10 100 1,000 10,000

Cascade size





Note. CCDF = complementary cumulative distribution function.

as “cascades” or “diffusion events”) is 1.3—meaningthat for every 10 introductions of content, there areon average three additional downstream adoptions.More strikingly, and as noted in Goel et al. (2012),we also find that the vast majority of cascades ter-minate within a single generation; specifically, about99% of adoptions are accounted for either by theroot nodes themselves or by the immediate follow-ers of root nodes. As noted previously (Goel et al.2012), however, the preponderance of small and shal-low events does not rule out the possibility thatlarge, structurally interesting events do occur, onlythat they occur sufficiently infrequently so as not to beobserved even in relatively large data sets. Exploitingthe fact that we have a much larger data set than inprevious studies—over a billion observations in ourinitial sample—we therefore now focus exclusively onthe subsample of rare events that qualify as large, andhence have the potential to be structurally interest-ing. Specifically, hereafter we restrict attention to the0.025% of diffusion trees containing at least 100 nodes(Figure 2), a requirement that leaves us with roughly1 out of every 4,000 cascades, and thus reduces thenumber of cascades we study in detail from approxi-mately 1 billion to 219,855.

4.1. Structural DiversityFrom this subpopulation of “successful” diffusionevents, Figure 3 displays a stratified random sam-ple ordered by structural virality ��T �. Specifically,cascades with between 100 and 1,000 adopters wereranked by ��T � and logarithmically binned, and arandom cascade was then drawn from each bin.9 We

9 We note that this exercise was performed only once to avoid handselection of the best “random” sample.








Goel et al.: The Structural Virality of Online Diffusion6 Management Science, Articles in Advance, pp. 1–17, © 2015 INFORMS

Figure 3 A Random Sample of Cascades Stratified and Ordered by Increasing Structural Virality, Ranging from 2 to 50

0 50 100





0 20 40 60 80




100 120



0 5 10


15 20



0 0.5 1.0





0 4020 60 80


100 120 140



0 2010 30 40


50 60 70

Notes. For ease of visualization, cascades were restricted to having between 100 and 1,000 adopters. Cumulative adoption curves (i.e., total cascade size overtime) are shown below each cascade, with time indicated in hours. For visual clarity, the adoption curves terminate at 99% of the final cascade size.

observe that the ordering from left to right and topto bottom by increasing ��T � is strikingly consistentwith how these same structures would be rankedintuitively in order of increasing virality, not only inthe trivial case of disambiguating broadcast and viralextremes, but also in making relatively fine-graineddistinctions between intermediate cases. Thus, ��T �not only seems to be a reasonable measure of struc-tural virality in theory, but also performs well in prac-tice. Considering now the cumulative adoption curvesshown below each cascade in Figure 3, we maketwo further observations. First, although the shapeof these adoption curves varies considerably, fromevents that experience a phase of rapid growth beforeleveling off to events that grow almost linearly overtime, there is no consistent relationship with struc-tural virality. Strikingly, in fact, the least structurallyviral of all our sampled events (top left) exhibits acumulative adoption curve that is almost indistin-guishable in shape from the most structurally viral(bottom right). Second, the timescales on which theadoptions take place (noted in hours on the horizontalaxis of the cumulative plots) also varies widely, fromless than an hour (bottom left) to three days (top left).As with the shape of the curves, however, there is noconsistent relationship between the timescale (speed)of an adoption process and its associated structural

virality. We conclude that our measure of structuralvirality not only effectively quantifies differences inthe underlying cascade structures, but is clearly doingso by using features of the diffusion process that arenot captured by aggregated data.

The ordering also highlights our first main empir-ical finding: Although the structures in Figure 3 areall of similar size (i.e., have similar aggregate num-bers of adopters), they exhibit remarkable diversityin structure, from an approximately pure broadcast(��T � ≈ 2, top left) to an ideal-type branching struc-ture (��T � = 34, bottom right), with numerous inter-mediate variations in between. The classical literatureon diffusion often posits a critical threshold—or “tip-ping point”—for virality, suggesting a sharp breakbetween cascades that are viral and those that are not.If the tipping point intuition is correct, one wouldexpect that relatively large diffusion events such asthose captured in the n = 100 (roughly one event in4�000) to n = 1�000 (one in 100,000) range would bedominated either by broadcasts on the one hand orby viral spreading on the other hand, but that com-binations of the two should not arise. More gener-ally, one might expect only a handful of canonicalforms to account for the majority of large events: forexample, some events spread exclusively via broad-cast, whereas others spread exclusively via word of








Goel et al.: The Structural Virality of Online DiffusionManagement Science, Articles in Advance, pp. 1–17, © 2015 INFORMS 7

Figure 4 Size and Structural Virality Distributions on a Log–Log Scale for Cascades Containing at Least 100 Adopters,Separated by Domain







100 1,000 10,000

Cascade size














3 10 30

Structural virality






Note. CCDF, complementary cumulative distribution function.

mouth, and others still spread by some combinationof the two. In other words, whatever one’s intuitivemental model of diffusion, one would likely expect tofind that successful diffusion events of a given sizewould be typified by some combination of broadcastand viral diffusion, or at least some small taxonomyof types. It is striking, therefore, that Figure 3 showsexamples of fine-grained variations in structural viral-ity across the entire range of possibilities.

4.2. Examining Popularity and Structural ViralityAlthough Figure 3 shows that one can find exam-ples of cascades across the spectrum of structuralvirality, it says little about their relative frequencyor how that varies by domain. To address thesequestions, Figure 4(A) shows the size distributionof cascades larger than 100 adopters for all fourdomains—news, videos, images, and petitions—whileFigure 4(B) shows the corresponding distributions ofstructural virality. As anticipated, Figure 4(A) showsthat cascades can grow very large: For images andvideos, the largest cascades attract several tens ofthousands of reposts, whereas the most popular newsstories are somewhat smaller (roughly 10,000 reposts),and petitions smaller still (several thousand reposts).In other words, although the vast majority of cas-cades are indeed small, large cascades do occur, albeitwith low frequencies. Moreover, the size distributionsappear to cluster into two categories: one compris-ing images and videos and the other comprising therather less popular categories of petitions and newsstories. In other words, the most popular videos andimages are more popular than the most popular newsstories and petitions not only because there are manymore of the former, but also because the correspond-ing distributions exhibit a shallower slope; that is,for any given percentile of the relevant population,videos and images are more popular than petitions

and news stories. Although we lack a compellingexplanation for this systematic difference, we notethat the vast majority of the most popular Twitteraccounts belong not to news organizations or peti-tion sites, but to celebrities, whose postings often con-tain images and videos. Moreover, YouTube and Insta-gram are among the top 10 most followed accounts,further facilitating the visibility of videos and images,respectively. It thus seems likely that one of the pri-mary drivers of large image and video cascades istheir promotion by individuals with large numbers offollowers, consistent with past results (Bakshy et al.2011).

Next, Figure 4(B) confirms the impression from Fig-ure 3 that structural virality varies widely, from 2(pure broadcast) to over 30. In particular, in contrastto classical “tipping point” theories of diffusion, wedo not see a bimodal distribution of structural viral-ity corresponding to broadcasts on the one hand andviral spreading on the other, but rather a continuousdistribution of structural virality, confirming our ear-lier speculation that in some sense every conceivablecombination of broadcasts and word-of-mouth trans-mission is represented. Interestingly, however, popu-lar petitions are substantially more structurally viralthan any other type of content, followed by videos,images, and news stories. For example, whereas abouta quarter of popular petitions have structural viralityof at least 10—meaning that petitions having garneredat least 100 adopters are quite likely to have grownvirally—only about 3% of videos, 1% of images, and0.5% of news stories exhibit the same level of struc-tural virality. In spite of the diversity evident bothin Figure 3 and Figure 4(B), therefore, the relativelylarger size of cascades involving videos and imagescombined with their relatively low structural viralitysuggests that the largest cascades in those categories








Goel et al.: The Structural Virality of Online Diffusion8 Management Science, Articles in Advance, pp. 1–17, © 2015 INFORMS

Figure 5 Box Plot of Structural Virality by Size on a Log–Log Scale, Separated by Domain

Petitions News Pictures Videos



















00 100







Cascade size



ral v



Note. Lines inside the boxes indicate median structural virality, whereas the boxes themselves show interquartile ranges.

are not especially viral in a structural sense. In the nextsection, we examine this possibility in more detail.

4.3. Relationship Between Popularity andStructural Virality

As pointed out earlier, the relationship between pop-ularity (cascade size) and structural virality is not apriori obvious; that is, depending on the empiricallyobserved preponderance of broadcasts in small ver-sus large events, the relationship could be positive(large events are less likely to be dominated by broad-casts than small events), negative (large events aremore likely to be dominated by broadcasts than smallevents), or neither. Put another way, if cascades typi-cally grow via person-to-person diffusion, we wouldexpect structural virality to increase with cascade size.On the other hand, if large cascades are the product ofbroadcasts attributable to popular users on Twitter—the most popular of whom have tens of millions offollowers—structural virality may not vary signifi-cantly with size, or could even decrease.

We investigate this question by examining the dis-tribution of structural virality conditional on cas-cade size for each domain. First, and consistent withFigure 4, Figure 5 shows that across all sizes forwhich they occur, popular petitions are consider-ably more viral than the other domains. Second,Figure 5 shows that across all domains and sizeranges, structural diversity varies considerably, con-firming again the visual impression of Figure 3.Third, however, Figure 5 shows that for three outof four domains—petitions, images, and videos—median structural virality remains surprisingly invari-ant with respect to size. For images and videos,moreover, it is also surprisingly low: even the verylargest cascades, comprising 10,000 reposts or more,exhibit median structural virality of less than 3, barelymore than the theoretical minimum of 2. For petitions,meanwhile, median structural virality is between 7and 8, roughly equivalent to a branching tree of depth

between three and four generations: not a pure broad-cast but still relatively shallow. Finally, for news, therelationship between size and structural virality ismore positive than for the other domains, but also stillsurprisingly low. For cascades of size 100, for exam-ple, median structural virality is approximately 3,whereas for the largest observed news cascades, com-prising 3,000 reposts, median structural virality is stillless than 8, comparable to petitions.

We emphasize that there is nothing inevitable aboutthis result. It could have been, for example, that thevery largest events are characterized by multigener-ational branching structures—indeed that is the clearimplication of the phrase “going viral.” So it is sur-prising that even the very largest events are, on aver-age, dominated by broadcasts. It is also surprisingthat the correlation between size and structural viral-ity is so low. As shown in Figure 6, the correlation fornews is 0.2, indicating a positive but noisy relation-ship, whereas for petitions it is even lower (0.04), indi-cating almost no relationship at all, and for picturesand videos it is essentially zero. In contrast with ourearlier result on diversity, which suggests that simplyknowing the size of a cascade reveals very little aboutits structure, the combination of generally low valuesof structural virality and low correlation with size sug-gests that if popularity is consistently related to anyone feature, it is the size of the largest broadcast.10

As in our discussion of Figure 4, we can onlyspeculate about why (a) petitions are so much morestructurally viral for every size category than otherdomains and (b) news stories show higher correlationbetween size and structural virality. We suspect, how-ever, that the main driving factor is once again a rela-tive dearth of large broadcast channels for petitions inparticular and to a lesser extent news organizations.

10 We also note that these results are not affected by the fact thatthe range of ��T � varies with cascade size; the results are qualita-tively identical when we use a measure of structural virality witha constant bounded range (see Appendix B).








Goel et al.: The Structural Virality of Online DiffusionManagement Science, Articles in Advance, pp. 1–17, © 2015 INFORMS 9

Figure 6 Correlation Between Cascade Size (Popularity) andStructural Virality Across Four Domains






News Petitions Videos Pictures










The popularity of images and videos, by contrast,is likely driven by celebrities, who increasingly havetens of millions of followers on Twitter, and whoseposting behavior likely favors content of a personaland often visual nature over news and calls to action.

5. Theoretical ModelingTo recap, we have three main empirical findings. First,and consistent with previous work (Goel et al. 2012),the vast majority of diffusion events are small andaccordingly lack much structure. Second, rare eventsthat do become large exhibit striking structural diver-sity. And third, the size of these cascades is at mostweakly correlated to their structural virality. Togetherthese findings present an interesting theoretical ques-tion, namely, can they be replicated by a single under-lying generative mechanism? And if so, what featuresare required? Although replicating some empiricalresults with a theoretical model does not on its ownimply that the model is an accurate representation ofthe true generative process (Ijiri et al. 1977), it is nev-ertheless possible to rule some models out.

To address this question, we consider a series ofvariations on the SIR model, a classical model ofbiological contagion (Kermack and McKendrick 1927,Anderson and May 1991) that has frequently beenadapted to model social diffusion processes,11 ini-tially to the specific context of new product adop-tion, where it is known as the Bass (1969) model,

11 Reflecting its origins in mathematical epidemiology, the modelis named for the three states—“susceptible,” “infectious,” and“recovered”—that each node in the network can occupy. Numer-ous variations of the basic SIR model have also been proposed,included the SI model, the SEIR model (where the “E’’ indicates“exposed”), the SIRS model, and so on (Anderson and May 1991).Here we refer to all such models canonically as SIR models.

and subsequently to a wide range of other contextsincluding the propagation of links over a network ofblogs (Leskovec et al. 2007). In any such model, thereare two key sets of parameters. First, when an indi-vidual is infected (in the present case, with a piece ofcontent), he or she subsequently infects each of his orher susceptible (i.e., not yet infected) contacts inde-pendently with probability �. Often � is assumed tobe a constant, but in the current context—where itrefers to the “infectiousness” of content—it is naturalto think of it as being drawn from some distribution(which itself may be described by additional parame-ters). And second, we must specify the nature of thecontact process, which here we model as a networkin which k is the average node degree (i.e., the num-ber of opportunities a typical node has to infect oth-ers) and � 2 is the degree variance.12

Before proceeding, it is helpful to introduce thequantity r = k� (known in mathematical epidemiol-ogy as the “basic reproduction number” or R0 of adisease). As alluded to earlier, a standard result fordiseases spreading on random networks is that thecondition r = c, where c = 1/�1 + ��/k�2� ≤ 1, con-stitutes a critical threshold or tipping point, separat-ing two regimes: a “supercritical,” or “viral,” regimer > c, in which small seeds can trigger exponentialgrowth leading to large epidemics, and a “subcritical”regime r < c, in which the contagion almost surelydies out after infecting only a small number of suscep-tibles. From this general result, moreover, two morespecific results follow. First, in Erdos–Rényi randomnetworks G�n�p�, where the expected degree is k∼ npand � 2 ∼ k (as n→�), the epidemic threshold condi-tion reduces to r ∼ 1 for k� 1. And second, in scale-free random networks (Barabási and Albert 1999) forwhich the variance diverges with the size of the net-work, it reduces to r ∼ 0 as n → � (Pastor-Satorrasand Vespignani 2001, Lyons 2000, Lloyd and May2001), meaning that in sufficiently large scale-free net-works, the subcritical regime effectively disappears.

These results are relevant to our analysis for tworeasons. First, because viral events for which r > 1exhibit exponential growth regardless of networkstructure and because we know from our data thatlarge events are extremely rare, we restrict our anal-ysis to the region 0 < r < 1, corresponding to whatin everyday usage would be thought of as “subcriti-cal” spreading. Second, because we will consider bothER and scale-free random networks, the usual super-

12 Additional parameters are also natural. For example, we onlyconsider strict SIR models in the sense that after one time step,infected nodes are “removed” from the dynamics, meaning thatthey can no longer infect others nor become reinfected. Althoughnatural for our case, where having “adopted” piece of content onecannot unadopt it, other assumptions are clearly possible, in whichcase additional parameters would be needed.








Goel et al.: The Structural Virality of Online Diffusion10 Management Science, Articles in Advance, pp. 1–17, © 2015 INFORMS

versus subcritical distinction is somewhat misleading.Specifically, whereas it does have a clear meaning forER networks, for which only contagions with r > 1 areviral in the everyday sense of growing exponentially,in scale-free networks, all contagions are viral in thetechnical sense of exceeding the epidemic threshold,even though they are “dying out” as they attempt tospread.13 As we will show next, in fact, models invok-ing ER networks are easily dismissed as incompatiblewith our empirical results, suggesting that the pop-ular tipping point notion is largely irrelevant to thekind of viral events we study here.

We consider four models of increasing complex-ity and verisimilitude. In all cases, each realizationof the simulation commences with an entirely sus-ceptible population comprising 25 million individualswithin which a single individual is randomly cho-sen to be the initially infected “seed” and proceedsuntil no further infections can take place.14 We startby investigating contagions characterized by constant� spreading on an ER random graph. In light of theenormous attention paid to variations of this modelboth in the mathematical epidemiology (Kermack andMcKendrick 1927, Anderson and May 1991) and mar-keting (Bass 1969, Valente 1995, Bass 2004) litera-tures, it is the natural baseline to consider. As notedabove, however, its relevance to our empirical datacan quickly be dismissed by showing that, consistentwith standard theoretical results (Anderson and May1991), the cascade size distribution is tightly centeredaround its mean regardless of the average networkdegree or infection rate, which is qualitatively differ-ent than the heavy-tailed size distribution we observein the data.

One explanation for this result is that our assump-tion of constant � is unlikely to be correct.Presumably, content introduced to Twitter exhibitslarge differences in intrinsic interestingness and

13 The intuitive explanation for this counterintuitive result is that inscale-free networks, a typical node is likely to be connected via atmost a short path to a “hub” node with an extremely high degreethat, if infected, can sustain an infection that would ordinarily dieout (Pastor-Satorras and Vespignani 2001).14 Clearly on Twitter a single unique piece of content can be intro-duced many times independently. In such cases, there is potentialfor two cascades to “collide,” which clearly cannot happen in oursimulations, where we introduce only one seed at a time. In lightof the extreme rarity of large cascades, however, and the large sizeof the Twitter network, such collisions are also rare; hence, we donot believe this simplification has any significant consequences. Wealso note that our model is a special case of what has been called“simple contagion” (Centola 2010), in which the infection probablyis independent across multiple exposures. In contrast with “com-plex contagion,” such as occurs in “threshold models” (Granovetter1978), where multiple exposures can combine in highly nonlinearways, the use of individual seeds for simple contagion is relativelyunproblematic.

breadth of appeal, and therefore likelihood of beingshared. This observation motivates the next modelwe consider, where the infection is again modeledas spreading on an ER graph, but the infectiousnessof each piece of content, �i, is now drawn from apower law distribution Pr4�i5 ∼ �−�

i , expressing themore plausible assumption that a large number ofitems in our sample are of low “quality” or “appeal”and hence are unlikely to spread (low �), whereasa small minority of appealing or high-quality itemsare much more likely to spread (high �). Studyingthis case, we do indeed recover the heavy-tailed sizedistribution from our empirical results. Interestingly,however, across parameter settings we consistentlyobserve high correlation between cascade size andstructural virality—because large cascades in ER mustnecessarily be multigenerational—which again standsin stark contrast to our empirical results. We there-fore conclude that it is the ER network, not necessar-ily the assumption about constant item quality, that isresponsible for the poor model fit.

Thus motivated, we now examine a third modelin which we again assume � to be a constant, butthe network is now a scale-free random network(Barabási and Albert 1999), constructed using the con-figuration method15 (Newman 2005, Clauset et al.2009), reflecting the roughly power law degree distri-bution p4k5 ∼ k−� observed for Twitter (Bakshy et al.2011). Sweeping over the two parameters, � and �,we simulated content of varying infectiousness diffus-ing over networks with varying degree skew. Figure 7shows the results of nearly 100 billion simulations,with 1 billion cascades generated for each parame-ter setting 4�1�5, roughly congruent with the numberof cascades we analyzed on Twitter. Figure 7 showsthat for certain parameters—r ≈ 005 and � ≈ 203—the model recapitulates several important features ofour empirical data.16 First, Figure 7(A) shows thatfor this parameter setting the probability of a givenpiece of content becoming “popular”—meaning thatit attracts at least 100 adoptions—is consistent withthe observed rate of roughly one in one thousand.Second, Figure 7(B) shows that the mean structuralvirality for these parameters is 5, which again is inline with our observations. Third, Figure 7(C) showsthat the correlation between size and structural viral-ity is also in the observed range. Finally, Figure 8shows the full marginal distributions of size and viral-ity, and the distribution of virality conditional on

15 For each node in the network, its number of followers (i.e., out-degree) was first randomly selected according to a discrete powerlaw degree distribution with exponent �, a minimum value of 10,and a maximum value of 1 million. Then nodes in the networkswere randomly connected while preserving the specified degrees.16 The power law exponent of � ≈ 203 is consistent with theobserved degree distribution on Twitter (Kwak et al. 2010).








Goel et al.: The Structural Virality of Online DiffusionManagement Science, Articles in Advance, pp. 1–17, © 2015 INFORMS 11

Figure 7 Likelihood of Becoming Popular (i.e., Having at Least 100 Adopters), Mean Structural Virality, and the Correlation Between Size andStructural Virality for Simulated Cascades Generated from an SIR Model on a Random Scale-Free Network, Plotted as a Function of theModel Parameters



Note. Each line corresponds to a different exponent � for the power-law network degree distribution, and r = �k is the expected number of individuals arandom node infects in a fully susceptible population.

size for this parameter choice, where we again seethat the simulated cascades are similar to the empiri-cally observed events. One notable difference betweenempirical and simulation results, however, is that thevariance in each bin (as measured by the interquartilerange) in the rightmost plot in Figure 8 is consider-ably less than that in Figure 5, indicating that empir-ical cascades exhibit much more structural diversityat any given size compared to those generated by themodel.

These simulation results can be interpreted in twoways. On the one hand, it is striking that so sim-ple a model—with only two tunable parameters—cancapture many of the basic empirical regularities ofwhat is undoubtedly a far more complex and mul-tifaceted system. For example, although the successof real-world products is almost certainly affectedby their quality, this connection is absent from ourmodel. Indeed, for any fixed parameter choice underthe SIR model, all cascades—the largest broadcasts,the most viral cascades, and the many events thatacquire only a handful of adopters—have the sameinfectiousness �. In other words, taking infectiousnessas a proxy for quality, in our simulations the largestand most viral cascades are not inherently better than

those that fail to gain traction, but are simply morefortunate (Watts 2002). On the other hand, it is alsointeresting that our model is not able to fully cap-ture the diversity of structural virality exhibited in theempirical data. Although we can only speculate onthe reasons for this limitation, two possible explana-tions immediately suggest themselves. The simplestexplanation is that as large as our simulated networksare (25 million nodes), they are still not as large nor isthe network structure as complex as the actual Twitterfollower graph, which comprises roughly 500 millionusers, the most connected of whom have well over50 million followers. Possibly, therefore, the differ-ence could be accounted for simply by increasing thesize of the networks by another one or two orders ofmagnitude—an increase that is computationally chal-lenging, but that is straightforward in theory. A sec-ond, and perhaps more likely, explanation is that ourassumption of constant � remains too simplistic, andthat introducing such variation into our model wouldalso increase the variation of structural virality at anygiven size.

The fourth and final model that we simulate there-fore replaces constant � with �i drawn from a power








Goel et al.: The Structural Virality of Online Diffusion12 Management Science, Articles in Advance, pp. 1–17, © 2015 INFORMS

Figure 8 Box Plot of Structural Virality by Size (on a Log–Log Scale) for 1 Billion Simulated Cascades Generated from an SIR Model on a RandomScale-Free Network with �= 2�3 and r = 0�5






100 1,000 10,000

Cascade size















2 3 5 10

Structural virality













Cascade size



ral v



Note. CCDF, complementary cumulative distribution function.

law distribution, identical to the ER case in our sec-ond model above. Surprisingly, however, a similarlyextensive set of simulations using this model findsthat it does not in fact lead to noticeably more struc-tural diversity; moreover, it leads to high correlationbetween size and structural virality. The reason forboth results is that higher (lower) values of �i gener-ate larger (smaller) events, not more (less) structurallyviral events of the same size. Thus, even though thediversity of �i does affect the size distribution of cas-cades, for a given cascade size it does not gener-ate more diversity of structural virality. Identifying amechanism that accounts for the observed diversityof structural virality therefore presents an interestingchallenge for future modeling work.

6. DiscussionReturning to our opening motivation, our papermakes three main contributions. First, we have intro-duced the concept of structural virality, one of the firstmeasures to formally quantify the structure of infor-mation cascades. Although our results are restrictedto the diffusion of information on Twitter, our struc-tural approach to diffusion processes applies quitegenerally, both to online and offline settings. It isoften claimed, for example, that some of the mostsuccessful Internet products in recent history, suchas Hotmail, Gmail, and Facebook, were driven pri-marily by word-of-mouth adoption, in part becausethe companies that created these products did notinitially have large advertising budgets, and in partbecause by design they contained features to explic-itly encourage sharing. Yet these products also ben-efitted from extensive media coverage, which mighthave driven large numbers of adoptions from a smallnumber of broadcast events. Likewise, although pop-ular Internet memes are typically described as havingspread virally, they also typically receive substan-tial media coverage. Without reconstructing the actualsequence of events by which a given product, idea, or

piece of content was adopted, and relatedly withouta metric for quantifying virality, the mere observa-tion of popularity—however rapidly accrued—allowsone to conclude little about the relative importance ofviral versus broadcast mechanisms in determining theobserved outcome. With the appropriate data, there-fore, our notion of structural virality could conceiv-ably shed light on a much broader range of diffusionprocesses than we have considered here.

Our second contribution is to measure the fine-grain structure of nearly 1 billion naturally occur-ring diffusion events in a specific online setting,namely, Web content spreading on Twitter. In partic-ular, we have identified hundreds of thousands oflarge cascades—the biggest such collection to date—revealing remarkable structural diversity of diffusionevents, ranging from broadcast to viral and containingessentially everything in between, where we empha-size that such an exercise would be difficult absenta metric for classifying and ordering the structureof these cascades automatically. In addition, we findrelatively low correlation between size and virality,highlighting the difficulty of determining how con-tent spread given only knowledge of its popularity.

Third, we have shown that a simple model of con-tagion is broadly consistent with our empirical find-ings. The theoretical literature has largely focusedon supercritical diffusion processes to model large,viral cascades; however, the vast majority of diffu-sion events comprise only a few nodes, and rarelyextend beyond one generation beyond the root node,or seed (Goel et al. 2012). Events of this latter kindare naturally attributable to subcritical diffusion,17

and hence one might thus be tempted to modelonline diffusion via two categorically distinct mech-anisms, separately accounting for the head and tail

17 For example, Leskovec et al. (2007) found that a susceptible-infected-susceptible (SIS) model with � = 0 025, equivalent tor ≈ 0 14, was able to replicate the size distribution of observed cas-cades of links over a network of blogs.








Goel et al.: The Structural Virality of Online DiffusionManagement Science, Articles in Advance, pp. 1–17, © 2015 INFORMS 13

of the distribution. Indeed, the very label “viral hit”implies precisely the exponential spreading of thesort observed in contagion models in their super-critical regime. It is therefore notable that essentiallyeverything we observe, including the very largest andrarest events, can be accounted for by a simple modeloperating entirely in the low infectiousness parame-ter regime. Indeed our best model fit is for r ≈ 005,which is considerably lower even than a previous“subcritical” estimate of � ≈ 0099 based on the dif-fusion of chain letters (Golub and Jackson 2010)—a difference that is likely due to the heavy-tailed(scale-free) degree distribution of Twitter.18

Finally, in addition to our three scientific contri-butions, we note that our work also contributes tothe emerging field of computational social science inthe sense that it addresses a traditional social sci-ence question—How does content spread via socialnetworks?—but answers it using a type and scaleof data that has only recently become available; thatis, only after tracing the propagation of over a bil-lion pieces of content can we collect an unbiasedsample of large, and exceedingly rare, cascades toobserve their subtle structural properties. By contrast,previous work (Goel et al. 2012) that investigatedthe propagation of nearly one million news storiesand videos—one of the largest diffusion studies atthe time—was only able to observe relatively smallevents, resulting in a qualitatively incomplete view ofdiffusion. In a similar vein, the most relevant previ-ous analysis of the structure of extremely large diffu-sion events relied on just two examples, specificallythe reconstructed paths of two Internet chain let-ters (Liben-Nowell and Kleinberg 2008). Althoughcollecting even two such examples required consider-able ingenuity, it is nevertheless the case that inferringgeneral principles from so few observations is inher-ently difficult (Golub and Jackson 2010, Chierichettiet al. 2011). One of our main findings, in fact, isthat large diffusion events exhibit extreme diversity ofstructural forms—a finding that necessarily requiresmany examples. Thus, although our current work isby no means exhaustive, its scale facilitates a signifi-cant step toward describing the nature and diversityof online information diffusion.

Appendix A. Computing Structural ViralityThe average distance measure of structural virality thatwe use, �4T 5, has often been applied in mathematical

18 We note that this finding also recalls earlier work that soughtto account for the surprisingly long-term and low-level persis-tence of computer viruses in terms of a low-infectiousness con-tagion spreading over a scale-free network (Pastor-Satorras andVespignani 2001). Although that work did not address the struc-tural properties of the events in question, the mechanism identifiedas responsible—namely, low-infectiousness contagion combinedwith the occasional encounter with a high-degree node—is largelysimilar to the one investigated here.

chemistry, where it is known as the Wiener index, and itsefficient computation has also long been known. For com-pleteness, here we present a simple and scalable method tocompute �4T 5. We begin by showing how the Wiener index,as well as the average depth of a tree, can be expressed interms of the sizes of various subtrees.

Lemma 1. For a tree T with n nodes, let depthavg denote theaverage depth of nodes in the tree. Letting S be the set of allsubtrees of T , we have



�S� = depthavg + 10

Proof. For any node vi ∈ T and any subtree S ∈ S, let�S4vi5 be 1 if vi ∈ S and 0 otherwise. Then,


�S� =∑













1 + depth4vi50

The result now follows by dividing each side by n. �

Theorem 2. For a tree T with n nodes, let depthavgdenote the average depth of nodes in the tree, let distavg denote theaverage distance between all pairs of distinct nodes (i.e., distavg =

�4T 5), and let S be the set of all subtrees of T . Then,

distavg =2n

n− 1


1 + depthavg −1n2



0 (A1)

In particular,

distavg =2n

n− 1




�S� −1n2



0 (A2)

Proof. Statement (A2) in the theorem follows directlyfrom (A1) together with Lemma 1, and so we only needto establish statement (A1). For any two nodes vi1vj ∈ T ,let LCA4vi1vj5 denote their lowest common ancestor: theunique node in T of greatest depth that has both vi and vj

as descendants (where a node is allowed to be a descendantof itself). Since the shortest path between vi and vj goesthrough LCA4vi1vj5, we have

dist4vi1vj5 = dist4vi1LCA4vi1vj55+ dist4LCA4vi1vj51vj5

= 6depth4vi5− depth4LCA4vi1vj557

+ 6depth4vj5− depth4LCA4vi1vj557

= depth4vi5+ depth4vj5− 2 · depth4LCA4vi1vj550

Let subtrees4vi1vj5 be the set of subtrees that contain bothvi and vj , and observe that this set consists of exactly thosesubtrees that contain LCA4vi1vj5. Since for any node v thereare 1 + depth4v5 subtrees that contain it,

�subtrees4vi1vj5� = 1 + depth4LCA4vi1vj550

Substituting this expression into the previous equation, wesee that

dist4vi1vj5= 2 + depth4vi5+ depth4vj5− 2�subtrees4vi1vj5�0








Goel et al.: The Structural Virality of Online Diffusion14 Management Science, Articles in Advance, pp. 1–17, © 2015 INFORMS

For any node vi ∈ T and any subtree S ∈S, let �S4vi5 be 1 ifvi ∈ S and 0 otherwise. Then, summing over all n2 pairs ofnodes, we have


i1 j=1

dist4vi1vj5 = 2n2+2n




i1 j=1



= 2n2+2n






The result follows by dividing through by n4n−15 the num-ber of pairs of distinct nodes. �

Theorem 2 shows that �4T 5 can be expressed in terms ofthe sizes of subtrees of T . Algorithm 1 uses this observationto efficiently compute �4T 5.

Algorithm 1 (Computing �4T 5)Require: T is a tree rooted at node r1: function Subtree-Moments(T 1 r)2: if T 0size4 5= 1 then F The base case3: size ← 14: sum-sizes ← 15: sum-sizes-sqr ← 16: else F Recurse over the children of the root r7: for c ∈ r0children4 5 do8: sizec , sum-sizesc , sum-sizes-sqrc

← Subtree-Moments4T 1 c59: size ← 0

10: sum-sizes ← 011: sum-sizes-sqr ← 012: for c ∈ r0children4 5 do13: size ← size + sizec14: sum-sizes ← sum-sizes + sum-sizesc15: sum-sizes-sqr ← sum-sizes-sqr

+ sum-sizes-sqrc16: size ← size + 117: sum-sizes ← sum-sizes + size18: sum-sizes-sqr ← sum-sizes-sqr + size2

19: return size, sum-sizes, sum-sizes-sqr20: function Average-Distance(T 1 r)21: size, sum-sizes, sum-sizes-sqr

← Subtree-Moments4T 1 r522: distavg ← 62 · size/4size − 157×

6sum-sizes/size − sum-sizes-sqr/size2723: return distavg

Figure B.1 Box Plot of an Alternative Measure of Structural Virality—Average Cascade Depth—by Size (on a Log Scale), Separated by Domain

Petitions News Pictures Videos



















00 100







Cascade size





Note. Lines inside the boxes indicate the medians, whereas the boxes themselves show interquartile ranges.

Table B.1 Rank Correlation Between Alternative Measures ofStructural Virality

Average Relative Distinct Averagedistance broadcast parent depth

Average distance 1 −0079 0073 0090Relative broadcast −0079 1 −0098 −0066Distinct parent 0073 −0098 1 0061Average depth 0090 −0066 0061 1

Appendix B. Alternative Measures ofStructural ViralityAlthough we have demonstrated that our particular defi-nition of structural virality is reasonable, there are severalother formalizations of the concept that also qualify as rea-sonable candidates. In particular, here we consider the fol-lowing three metrics:

1. the relative size of the largest broadcast (i.e., thelargest number of children of any single node in the dif-fusion tree, as a fraction of the total number of nodes inthe tree);

2. the probability that two randomly selected nodes havea distinct parent node;

3. the average depth of nodes in the tree.Simple inspection shows that all three of these alterna-

tives distinguish between the extremes of a single, largebroadcast on the one hand and a multigenerational “viral”cascade on the other. However, they all capture subtly dif-ferent structural aspects of diffusion trees, and also failfor somewhat different pathological cases. Consequently, aswith our primary definition above, it is difficult to evalu-ate the utility of the various metrics on theoretical groundsalone, or even to assess their similarity. In practice, however,we find that they are all highly correlated with our cho-sen average path length measure, and also with each other.Specifically, Table B.1 shows that when computed over theentire set of empirically observed cascades with at least100 adopters, �4T 5 has an absolute rank correlation greaterthan 0.73 with all three alternative measures. Moreover,our empirical results are qualitatively similar regardless ofwhich of these alternative measures of structural viralitywe apply. For example, Figure B.1 shows the relationshipbetween size and average depth, analogous to Figure 5,and from which essentially the same conclusions could bedrawn.








Goel et al.: The Structural Virality of Online DiffusionManagement Science, Articles in Advance, pp. 1–17, © 2015 INFORMS 15

Thus, although we cannot rule out the possibility that asuperior metric to ours can be defined, we can at least sub-stantiate two related claims: first, that our choice of metricis at least roughly as good as a number of other plausiblecandidates, and second, that our substantive findings arerobust with respect to the particular manner in which weformalize the concept of structural virality.

Appendix C. Tree Construction MethodHere we describe the process of constructing a diffusion treefor a particular piece of content (e.g., a given URL). Treesare composed of one node for each user who has adoptedthe content, and each edge links a user back to an inferred“parent.” After each adoption has been identified as eithera root or the child of another post, we construct the cascadeof adoptions.

In an ideal setting we would have access to this infor-mation for each adoption, but in practice these details arenot always available. The best-case scenario is use of Twit-ter’s official retweet functionality, which enables a user toeffectively forward a tweet that was originally authoredby someone else. Attribution is clear in these cases, andtree construction would be relatively straightforward if alladoptions were of this form. Unfortunately, however, usersalso repost content using a variety of unofficial conven-tions, which complicate the attribution task. For instance,the unofficial retweet convention amounts to copying thetext of a tweet and prepending “RT @username” to creditanother individual. Twitter treats these posts as originallyauthored content and has no formal way of linking themback to original posts. Finally, users may forego crediting asource entirely, in which case one must make an educatedguess about who (if anyone) in their feed exposed them tothe content and who should be credited as responsible fortheir adoption.

We decompose the process of inferring a parent into twosteps, described in detail below. We estimate that our infer-ence procedure correctly identifies the parent of an adoptionin approximately 95% of instances.

1. Identify potential parents. For each user who adopts apiece of content, we identify a set of “potential parents,”defined as individuals whose adoption of a piece of con-tent appears in the focal user’s timeline prior to the focaluser’s adoption. In other words, potential parents are theset of individuals who are likely to have exposed the user tothe adopted content. To identify these potential parents, wenote that a user’s timeline contains (1) all posts originallyauthored by the user’s friends and (2) tweets authored byothers that at least one of the user’s friends has “officiallyretweeted” using Twitter’s built-in reposting functionality.In particular, any tweet appears at most once in a user’stimeline regardless of how many of his or her friends haveofficially retweeted it.19 To compute the set of potential par-ents for a given adoption, we join activity from the TwitterFirehose application programming interface (API), whichprovides details about each tweet, with the Twitter followergraph, which provides the listing of who follows whom.

19 Any nonofficial reposting—e.g., using the “RT @username”convention—is considered originally authored, resulting in poten-tially repeated content in a user’s timeline.

2. Infer a single parent. We now identify the single mostlikely parent from the set of all potential parents of a givenadoption. To do this, we consider three cases based on howthe focal user posted the content.

a. Official retweet. If the focal user officially retweeteda post that appeared in their timeline (i.e., retweeted thepost via Twitter’s built-in functionality), then the TwitterAPI provides the ID of the original tweet. We then usethis information to identify the individual who introducedthe post to the user’s timeline as the parent. We note thatthe parent need not be the original author of the tweet—for example, in the case of a friend who retweeted a thirdparty, as described above. Also, users occasionally officiallyretweet content that did not appear in their timelines (e.g.,because they discovered it by browsing); in these cases wetreat the focal user as a “root” and do not assign a par-ent. Overall, in these official retweet cases—which consti-tute 65% of the instances we consider—we almost certainlycorrectly attribute the tweet.

b. Accredited repost. In the case of a nonofficial retweet,credit may still be present in the form of a mentioned user,for example, using the “RT @username” convention. Weidentify as the parent the individual who most recentlyintroduced a post of that content, authored by the men-tioned user, to the focal user’s timeline. This mentioned usermay be a friend of the focal user, in which case the friendis assigned as the parent. Alternatively, the mentioned usermay be a third party—e.g., a friend of a friend. In this case,the friend who most recently mentioned the accredited useralong side the piece of content is identified as the parent.As above, if no such friend can be identified, we treat thefocal user as a root and do not assign a parent. Accreditedposts constitute 10% of the adoptions we analyze, and asin the case of official retweets, the inferred parent is almostcertainly correct.

c. Uncredited repost. In this final, case we lack anyexplicit information about how the user was exposed to thecontent and simply assign as the parent the friend who mostrecently introduced the content to the focal user’s timeline.If no such friend exists, we again treat the focal user as aroot. To assess the accuracy of our inference strategy in thiscase, we apply it to the set of official retweets, for whichwe are fairly certain which individual is the parent of anygiven adoption. We find that the most-recent-introductionheuristic correctly identifies the parent 79% of the time.

Since our inference procedure almost certainly identifiesthe correct parent in the first two cases—official retweetsand accredited reposts, which together account for 75% ofadoptions—and since we estimate 79% accuracy for theremaining 25% of adoptions, we conclude that the overallaccuracy of our parent inference strategy is 95%.

Appendix D. Off-Channel DiffusionAlthough our empirical findings are qualitatively quite sim-ilar across the four distinct domains studied above, it is pos-sible that all four suffer from one of two systematic biasesthat might affect our conclusions. First, a potential problemwith studying the diffusion of external content on Twitter(e.g., news stories from the New York Times and videos fromYouTube) is that the same content may also spread via otherchannels, such as Facebook or email. As a result of this “off-channel” diffusion, two individuals on Twitter who appear








Goel et al.: The Structural Virality of Online Diffusion16 Management Science, Articles in Advance, pp. 1–17, © 2015 INFORMS

Figure D.1 Size and Structural Virality Distributions on a Log–Log Scale for Popular Hashtag Cascades, Containing at Least 100 Adopters





100 1,000 10,000

Cascade size











3 10 30

Structural virality





Note. CCDF, complementary cumulative distribution function.

to have introduced the same piece of content independentlymay in fact be connected, thus leading us to mistakenly treata single diffusion tree as two disjoint events. A second con-cern is that our use of reposting rather than retweeting alsopotentially biases our data. Specifically, user–follower sim-ilarity (i.e., homophily) may lead connected users to postthe same content independently in close temporal sequence,leading us to conflate similarity with influence (Shalizi andThomas 2011, Aral et al. 2009, Lyons 2011).

To check that off-channel diffusion does not system-atically bias our findings, we consider the diffusion ofTwitter-specific “hashtags”—short fragments of text used toindicate the topic of a tweet. Because such hashtags are lesslikely to have originated outside of Twitter, and because forthe same reason they are less likely to migrate off of Twitter,these data are correspondingly less susceptible to any biasesassociated with off-channel diffusion. Moreover, to ensureas much as possible that we are considering only on-Twitteruses of hashtags, we restrict our sample to “long” hash-

Figure D.2 Box Plot of Structural Virality by Size on a Log–Log Scalefor Hashtag Cascades











Cascade size



ral v



Note. Lines inside the boxes indicate the median structural virality, whereasthe boxes themselves show interquartile ranges.

tags, which are especially unlikely to be used elsewhere. Todefine “long,” we note that hashtags on Twitter are gen-erally written in camel case (e.g., #CamelCase). Treatingeach substring that begins with a capitalized letter and endsimmediately before the next capitalized letter as a “word,”we trace the diffusion of hashtags that include five or moresuch words (e.g., #ThisIsALongHashtag). As infrequent asthese long hashtags are relative to hashtags in general, theyare still plentiful, amounting to 58,000 cascades with at least100 adopters. Figures D.1 and D.2 show that the diffusionof these long hashtags yields qualitatively similar results toour primary analysis, suggesting that off-channel diffusionis not driving our findings.

