a corpus linguistic study of bollywood song lyrics in the...

10
A Corpus Linguistic Study of Bollywood Song Lyrics in the Framework of Complex Network Theory Aseem Behl IIIT-Hyderabad Hyderabad, India 500032 [email protected] Monojit Choudhury Microsoft Research India Bangalore, India 560080 [email protected] Abstract Every year the Mumbai-based Hindi movie industry, Bollywood, produces sev- eral hundred movies and associated thou- sands of songs that collectively reflect the state and evolution of popular culture and language usage in India over last eight decades. In this paper, we report a corpus linguistic study of Bollywood songs con- ducted through standard statistical meth- ods as well as the more recent techniques of complex network analysis. Our study reveals several interesting features of Bol- lywood lyrics, some of which are strik- ingly similar to and some remarkably dif- ferent from those of the standard natural language text corpora of Hindi. 1 Introduction Bollywood 1 is the term popularly used for the Hindi-language film industry based in Mumbai, India. More than 800 movies are produced out of Bollywood every year 2 . Almost all Bollywood movies feature several songs that are very popular not only in India, but across the globe. In fact, Bol- lywood songs are one of the most searched items on the Web from India 3 . Bollywood songs are generally composed in Hindi, though often the vocabulary and language usage deviate from that of standard Hindi. Ex- tensive use of Urdu and Persian words, composi- tions in various dialects of Hindi (e.g., Braj , Ra- 1 http://en.wikipedia.org/wiki/ Bollywood 2 http://geography.about.com/od/ culturalgeography/a/bollywood.htm 3 http://www.google.com/intl/en/press/ zeitgeist2010/regions/in.html jasthani, Maithili) and Punjabi songs have always been a part and parcel of Bollywood lyrics. In re- cent times, English-Hindi code-mixing is on rise in Bollywood lyrics - a reflection of the ongoing socio-cultural changes in India. Bollywood songs are written for diverse themes such as emotional relationships (of love, betrayal, friendship, parent- child, brother-sister, etc.), social events and func- tions (e.g., marriage, birthday, anniversary), festi- vals and rituals, devotion, songs for children and so on. In this work, for the first time, we present a cor- pus linguistic analysis of Bollywood song lyrics. We crawled the Web to construct a Bollywood lyrics corpus, computed various corpus statistics, and compared them against those for standard text corpus of Hindi. We constructed network mod- els of Bollywood lyrics and standard text corpus, and further compared the topological properties of these networks to understand how similar or dif- ferent is the language of Bollywood lyrics from that of standard Hindi text; and if they are differ- ent, then what are the factors giving rise to these differences. Use of Complex Network Theory (CNT) for this work is motivated by the recent developments in the study of linguistic networks (see (Choud- hury and Mukherjee, 2009) for a review). CNT has provided a suitable framework for modeling and studying complex adaptive systems occur- ring in disciplines as diverse as biology, sociol- ogy, physics, economics and technology. In cor- pus linguistics, network models have been used to explore and explain universal properties as well as unique distinctive features of languages. Our adoption of this framework is strongly driven by previous successful studies of word co-occurrence networks (Cancho and Sol´ e, 2001a). Proceedings of ICON-2011: 9 th International Conference on Natural Language Processing Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2011

Upload: hoangkhue

Post on 11-Feb-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

A Corpus Linguistic Study of Bollywood Song Lyrics in the Framework ofComplex Network Theory

Aseem BehlIIIT-Hyderabad

Hyderabad, India [email protected]

Monojit ChoudhuryMicrosoft Research IndiaBangalore, India 560080

[email protected]

Abstract

Every year the Mumbai-based Hindimovie industry, Bollywood, produces sev-eral hundred movies and associated thou-sands of songs that collectively reflect thestate and evolution of popular culture andlanguage usage in India over last eightdecades. In this paper, we report a corpuslinguistic study of Bollywood songs con-ducted through standard statistical meth-ods as well as the more recent techniquesof complex network analysis. Our studyreveals several interesting features of Bol-lywood lyrics, some of which are strik-ingly similar to and some remarkably dif-ferent from those of the standard naturallanguage text corpora of Hindi.

1 Introduction

Bollywood1 is the term popularly used for theHindi-language film industry based in Mumbai,India. More than 800 movies are produced outof Bollywood every year2. Almost all Bollywoodmovies feature several songs that are very popularnot only in India, but across the globe. In fact, Bol-lywood songs are one of the most searched itemson the Web from India3.

Bollywood songs are generally composed inHindi, though often the vocabulary and languageusage deviate from that of standard Hindi. Ex-tensive use of Urdu and Persian words, composi-tions in various dialects of Hindi (e.g., Braj , Ra-

1http://en.wikipedia.org/wiki/Bollywood

2http://geography.about.com/od/culturalgeography/a/bollywood.htm

3http://www.google.com/intl/en/press/zeitgeist2010/regions/in.html

jasthani, Maithili) and Punjabi songs have alwaysbeen a part and parcel of Bollywood lyrics. In re-cent times, English-Hindi code-mixing is on risein Bollywood lyrics - a reflection of the ongoingsocio-cultural changes in India. Bollywood songsare written for diverse themes such as emotionalrelationships (of love, betrayal, friendship, parent-child, brother-sister, etc.), social events and func-tions (e.g., marriage, birthday, anniversary), festi-vals and rituals, devotion, songs for children andso on.

In this work, for the first time, we present a cor-pus linguistic analysis of Bollywood song lyrics.We crawled the Web to construct a Bollywoodlyrics corpus, computed various corpus statistics,and compared them against those for standard textcorpus of Hindi. We constructed network mod-els of Bollywood lyrics and standard text corpus,and further compared the topological properties ofthese networks to understand how similar or dif-ferent is the language of Bollywood lyrics fromthat of standard Hindi text; and if they are differ-ent, then what are the factors giving rise to thesedifferences.

Use of Complex Network Theory (CNT) forthis work is motivated by the recent developmentsin the study of linguistic networks (see (Choud-hury and Mukherjee, 2009) for a review). CNThas provided a suitable framework for modelingand studying complex adaptive systems occur-ring in disciplines as diverse as biology, sociol-ogy, physics, economics and technology. In cor-pus linguistics, network models have been used toexplore and explain universal properties as wellas unique distinctive features of languages. Ouradoption of this framework is strongly driven byprevious successful studies of word co-occurrencenetworks (Cancho and Sole, 2001a).

Proceedings of ICON-2011: 9th International Conference on Natural Language ProcessingMacmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2011

The rest of this paper is organized as follows.Sec. 2 gives a brief overview of previous workon complex network theory as a tool to under-stand linguistic structure. Sec. 3 describes thelyrics corpus creation and some basic statisticsof the corpus, including word frequency distri-butions. Sec. 4 introduces the network modelsand their topological properties and correspond-ing linguistic interpretations. Sec. 5 summarizesthe salient facts about Bollywood lyrics revealedthrough our analysis, provides plausible socio-cultural explanations for some of these observa-tions, and lists some interesting research questionsfor future work.

2 Corpus Linguistics and ComplexNetworks

Corpus linguistics approaches the study of naturallanguage through corpora and deals with findingpatterns associated with the lexical or grammaticalfeatures of a language (Bennett, 2010). Statisticalproperties such as (a) the frequency of linguisticelements such as morphemes, words and phrasesoccurring in a corpus, (b) the distribution of prob-ability mass across these linguistic elements and(c) the frequencies of co-occurrence of the linguis-tic elements with each other are typically used tocharacterize a corpus (Gries, 2010).

Recently, Complex Networks Theory has beenintroduced as a new tool to model and study thestatistical properties of a corpus (Mehler, 2008).Complex network refers to a system of nodes de-noting physical or abstract entities and a set ofedges connecting them, which usually representcertain interactions between the entities. CNThas been successfully used to model the struc-ture and dynamics of a broad sphere of complexadaptive systems in nature and society; popularexamples include the cell, the Internet, or a net-work of computers. For review of complex net-works see (Albert and Barabasi, 2002; Newman,2003). In linguistics, network models have beenused to study linguistic entities and their interac-tions (Choudhury and Mukherjee, 2009); besidesapplications in NLP, such studies have unearthedfascinating linguistic universals and helped us un-derstand their origin and emergence.

Of special interest to us are the studies thatmodel a corpus as a complex network. Canchoand Sole (2001a) introduced the concept of WordCollocation Networks, where the nodes represent

unique words and edges connect nodes that co-occur in a sentence close to each other. Throughempirical studies of the British National Corpus,the authors showed that these networks exhibitsmall-world property similar to social network ofpeople that is necessary for fast access of words inthe mental lexicon; the networks also feature two-regime power-law which they argued was due tothe existence of distinct kernel and peripheral lex-icon in a language. Since then word collocationnetworks have been used to model corpora in vari-ous languages (see for example (Choudhury et al.,2010)) and various other kinds of datasets, suchas query logs (Roy et al., 2011) and Indus valleysymbols (Sinha et al., 2009) to establish linguisticnature of apparently non-linguistic datasets. AsMehler (2008) has rightly pointed out, “develop-ment of complex text networks provides knowl-edge about constraints of the “nonartificiality” oftext corpora by analogy with Zipf’s law in the caseof lexical systems”. The present study takes in-spirations and ideas from this rich and fascinatingbody of research work.

3 Bollywood Lyrics Corpus

There is no known corpus or database exclusivelybuilt for Bollywood song lyrics. However, being avery popular entertainment industry and one of themost frequently searched item on the web, thereare several managed or user-contributed websitesavailable for Bollywood lyrics. These websitescan be readily crawled to build a Bollywoodsong corpus. In most of these websites, such ashindilyrix.com and lyricsmasti.com, the lyrics arearchived in a non-standard Romanized form in-stead of Devanagari - the script used for writ-ing Hindi. Non-standard Romanization of Hindiwords leads to high degree of spelling variation inthe lyrics that are crawled from two different web-sites. For instance, the song vo phlF bAr jbhm Eml� (movie: Pyaar Mein Kabhi Kabhi) ispresent in one website as Woh Pehli Bar Jab HumMile and in another one as Wo Pahli Baar Jab HamMile. While users can easily read and understandRomanized lyrics, such representations cannot beused for a corpus linguistic analysis of songs, un-less the spellings are normalized to a standard Ro-man form or transliterated back to Devanagari.Both of these approaches, however, would requirea non-trivial amount of research and experimen-tal efforts and can induce several errors during the

process. In order to avoid this problem, we re-stricted ourselves to crawling only those websiteswhere the lyrics have been archived in Devana-gari. Since most of the popular websites and con-sequently the bulk of the song lyrics on the Webare present only in Romanized form, we did missout a large number of potential songs that couldhave enriched the corpus and the subsequent anal-ysis.

There are quite a few websites that archivelyrics in Devenagari for songs ranging from afew hundreds to several thousands. We identi-fied three websites, smriti.com, lyricsindia.net and10lyrics.com, which together cover almost all thesongs for which Devenagari lyrics is available onthe Web. Note that the same song might be presentin more than one website, and sometimes multipleversions of a lyrics might be present even in thesame website (e.g., lyrics of the original song, anda remix version of the song, which are often iden-tical). Therefore, we identified duplicate copiesof the lyrics by a simple word for word compari-son to ensure that only a single version of a songis included in our corpus. This resulted in a cor-pus with 6529 song lyrics and 0.5 million wordsin Devanagari script. Table 1 reports some basicstatistics of the corpus.

Along with the lyrics, the websites also containseveral related facts such as the composer, lyricist,movie-title and year of release of the song. Theseinformation has also been crawled, and stored asmetadata. Amongst these, the date of release is ofspecial interest because it can help us analyse andunderstand the patterns of evolution of Bollywoodlyrics. Such an analysis is beyond the scope of thepresent work.

Table 1: Corpus StatisticsNumber of songs 6529Number of distinct movies 2634Tokens 491KTypes 17.9KTypes after morphological analysis 12.3K

3.1 Morphological Analysis

Hindi morphology is rich and productive. There-fore, a high variation in types in a Hindi corpusmight result from morphological richness ratherthan diversity of topics. While song lyrics may notsport very diverse themes and vocabulary, due to

the presence of morphological forms of only a fewdistinct roots, the corpus might still contain a largenumber of distinct words. In order to understandand separate out such effects, we conducted all ouranalyses on the original corpus where each distinctsurface form is treated as a distinct type, and alsoon a stemmed version of the same corpus, whereonly the root forms and not their morphologicalvariations are considered as distinct types. Weused Hindi morphological analyser developed byAkshara Bharathi Group to derive the morpholog-ical roots of the words (Bharati et al., 1995), whichis approximately 95% accurate. As shown in Ta-ble 1, for 17.9K distinct words there are 12.3K dis-tinct roots, implying that on an average a root has1.5 variations.

3.2 Standard Text Corpus of Hindi

In order to understand how, if at all, the corpuscharacteristics of Bollywood lyrics differ from thatof standard Hindi text, one should repeat the cor-pus analysis studies on standard Hindi text cor-pora of similar size and diversity. For this purposewe sampled a general and two restricted domaincorpora, each containing 0.5 million words fromthe HC Corpora4. The HC Corpora contain Hindilanguage texts collected from publicly accessiblesources like Hindi news websites. We utilized thetopic labels available in HC Corpora to build ourdomain restricted corpora for the domains of Busi-ness & economics and Religion, spirituality & as-trology. One would anticipate that the character-istics of the domain specific corpora will matchthat of the Bollywood lyrics corpora more closelythan those of the general corpus because the vo-cabulary of Bollywood lyrics is rather restricted toonly a certain kind of emotional and socio-culturalthemes.

3.3 Frequency distribution of words

One of the most popular and universally observedcharacteristics of any natural language corpus isZipf’s Law(Zipf, 1949), which says that if thewords in the corpus are sorted by their frequencyof occurrence, then the rank of a word in thissorted list is inversly proportionate to its fre-quency. In other words, if f is the frequency ofa word and r is its rank, then

f = Ar−γ

4http://www.corpora.heliohost.org/

where, γ is close to 1 for all languages, andA is some positive constant of proportional-ity. There have been several empirical studieson Zipfian distribution of natural language text(see http://www.nslij-genetics.org/wli/zipf/ for comprehensive bibliography onZipf’s law.), including some for Indian languagecorpora (Jayaram and Vidya, 2008). Further-more, there have been various mathematical inves-tigations and modeling studies to understand theorigin of Zipfian distribution in natural languagetext (Cancho and Sole, 2002; Kornai, 2002).

Figure 1 plots the rank-frequency distribution ofwords and their morphological roots for the Bolly-wood Lyrics corpus in a doubly logarithmic plot.According to the Zipf’s law, this plot should be astraight line with slope close to 1 (usually between0.9 and 1.1 for all languages), and that is preciselythe case for the standard Hindi text corpus and thetwo domain specific corpora (plots omitted due topaucity of space, but reader can refer to (Jayaramand Vidya, 2008) for similar analysis for Hindi).However, to our surprise the rank-frequency dis-tribution of the words for the Bollywood lyricscorpus can be best approximated by two straightlines in the doubly logarithmic plot instead of one.The best fitting lines for the upper and lower partsof the distributions are also shown in Figure 1,which have been constructed through standard re-gression analysis. The slopes of these lines, whichwe shall refer to as γ1 and γ2 for the left (upper)and the right (lower) regimes respectively are 0.94and 1.68 for the words, and 1.04 and 1.62 for mor-phological roots.

This distribution, which significantly deviatesfrom Zipf’s law, is commonly referred to in theliterature as a two-regime power-law. Note that aspredicted by Zipf’s law, the exponent for the firstregime, γ1 is very close to 1 for both stemmedand unstemmed word distributions. However,γ2 is significantly higher than 1. Interestingly,there are only around 700 to 1000 high frequentroots/words in the first regime which follow Zipf’slaw. The majority of the medium and low fre-quency roots/words comprising 95% of the vocab-ulary significantly deviates from the universallyobservable Zipf’s law.

3.4 Why deviation from Zipf’s Law?

Two regime-power law for rank-frequency dis-tributions was first reported by Cancho and

Sole (2001b). Based on an extensive study con-ducted on a very large English corpus, the authorsobserved that when the corpus size grows beyondcertain limits, a two-regime power-law starts toemerge instead of a single regime Zipfian distri-bution. However, the power-law exponents of thetwo regimes were found to lie between 0.9 and 1.1.Cancho and Sole hypothesized that the two-regimewas an outcome of kernel and periphery distinc-tions in natural language vocabulary, which will bediscussed later. Note, however, that the deviationfrom the Zipf’s law is much more pronounced forLyrics corpus (with γ2 > 1.5) than that reportedin (Cancho and Sole, 2001b). Furthermore, thesize of the lyrics corpus is several orders of mag-nitude smaller than those studied in (Cancho andSole, 2001b).

Mathematical investigations of Zipf’s law haveled to several interesting insights, of which thefollowing is of special significance: It has beenobserved that theoretically a perfect Zipfian dis-tribution is possible only when the vocabularyof a growing corpus is potentially infinite (Kor-nai, 2002). In fact, parallel threads of investiga-tion based on the concept of preferential usage ofwords in the text suggest that if the vocabulary orthe pool of words to choose from is finite, theneven though the rank-frequency distribution mightresemble a Zipfian distribution when the corpus issmall, actually it is better approximated by a β-distribution5. Thus, as the corpus grows beyond alimit the Zipfian nature of the distribution breaksdown and it starts resembling a two-regime power-law when plotted on doubly-logarithmic scale (Pe-ruani et al., 2007).

On the basis of these theoretical results, we ar-gue that the vocabulary of Bollywood lyrics is ex-tremely restricted; this restriction is stronger thaneven that of texts sampled from highly specializeddomains, because the latter does obey Zipf’s law.On the other hand, a large number of Bollywoodsongs are composed every year which is leadingto a steep rise in the corpus size without any sig-nificant rise in the corpus vocabulary. Note thatrepetition of lines and stanzas in songs also addsto the apparent lack of diversity in the vocabularyof the corpus. In absence of any similar studies of

5A random variable is said to have a β-distribution withparameters β > 0 and β > 0 if and only if its probabilitymass function is given by, f(x) = Γ(α+β)

Γ(α)Γ(β)xα−1(1−x)β−1

for 0 < x < 1 and f(x) = 0 otherwise. Γ() is the Eulersgamma function.

Figure 1: Frequency distribution of words in cor-pus.

songs in other languages, we are unable to com-ment whether this property is unique to Bollywoodlyrics or it is a property of lyrics in general.

4 Lyrics Networks: Construction andProperties

In this section we will study the properties ofBollywood lyrics by constructing complex net-work models. We will compute basic topologi-cal properties of the networks and compare thoseagainst that of complex networks built from stan-dard Hindi text corpora. Following some of theprevious studies on network based corpus linguis-tic analysis, here we will construct two basic net-work models. Although we conducted all the net-work analyses on both original and stemmed ver-sions of the corpus, we observe that there are nosignificant qualitative differences between the twoin their network characteristics. Therefore, weshall report the results only for the analyses con-ducted on the surface forms.

4.1 Network ModelsLyric-Word Network (LWNet): We defineLWNet as a bipartite graph G = 〈VL, VW , E〉where VL is the set of nodes of type song lyricsand VW is the set of nodes comprising uniquewords in the lyrics corpus. E is the set of edgesthat run between VL and VW . There is an edgee ∈ E between two nodes vl ∈ VL and vw ∈VW if and only if the word w occurs in the songl. Figure 2 illustrates the nodes and edges ofLWNet. The construction of LWNet is drivenby similar modeling of various complex phenom-ena such as Article-author network, where theedges denote which person has authored whicharticles (Newman, 2001), Movie-actor network,

Figure 2: Illustration of nodes and edges ofLWNet.

Figure 3: A partial illustration of the nodes andedges in WCNL. The labels of the nodes denotewords in the lyrics corpus

where movies and actors comprise the two setsand an edge between them indicates that a par-ticular actor acted in a particular movie (Ram-asco et al., 2004), and Phoneme-Language Net-work where phonemes and languages constitutethe two sets and an edge between them indicatesthat a particular phoneme occurs in a particularlanguage (Mukherjee et al., 2008).

While studying bipartite networks, such as theone defined here, usually the one-mode projectionof the network is also studied, which is defined asthe network formed by the nodes present in onlyone of the partitions and the edges running be-tween two nodes if they are connected to at leastone common node from the other partition in thebipartite network. In the context of LWNet, theone-mode projection onto the word partition leadsto a graph where the set of nodes is VW , and twoword nodes are connected by an edge if they oc-cur in the same song. Since a large number ofunique words (52 on an average) are present in asingle song lyric, every lyric node will thereforetranslate into a clique of size 52 in the one-modeprojection. Such a one-mode projection will be

extremely dense and not of much interest for com-plex network analysis. Therefore, along the linesof some other network-based studies in corpus lin-guistics (Cancho and Sole, 2001a; Choudhury etal., 2010), we study the word co-occurrence net-work of the lyrics corpora.

Word Co-occurrence Network for Lyrics(WCNL): We define WCNL as a network ofwords represented by a graph G = 〈VW , E〉. Anedge e ∈ E represents co-occurrence of two wordsin the same song-lyrics, that is, if they are sepa-rated by zero or one word in a song lyric (Canchoand Sole, 2001a). Restriction of a network is usedto prune the edges which might occur purely bychance (Cancho and Sole, 2001a). In a restrictednetwork, if a pair of words co-occurs less than ex-pected when independence between such wordsis assumed, the edge between the word nodes ispruned out. Figure 3 partially illustrates the nodesand edges of WCNL.

Note that WCNL is a slightly more restricteddefinition for the one-mode projection of LWNet;instead of adding edges between two words which“occurs” in the same lyric, we further restrict thisoccurrence by “co-occurrence within a small win-dow”. Saha Roy et al. (2011) has introducedthe terms global and local co-occurrence respec-tively to denote these two versions of word co-occurrence networks.

4.2 Topological Properties of LWNet

Degree distribution of the nodes in VL: Degreeof a node refers to the number of edges incidenton it. Thus degree of a lyric node in LWNet isthe number of unique words present in the song.The degree distribution for LWNet of the nodesin VL is a plot where the x-axis denotes degreek expressed as a fraction of the maximum degreeand the y-axis denotes pk, the fraction of nodesin VL having degree k. The distribution indicatesthat the number of unique words present in a Bol-lywood song follows a β-distribution. The valuesof α and β obtained using method-of-moments es-timates are 5.1 and 18.9 respectively. The distribu-tion peaks at 44, whereas the mean of the distribu-tion for LWNet is 52.

Degree distribution of the nodes in VW : Fig-ure 4 shows the cumulative degree distribution plotof the nodes in VW in LWNet. In this figure, x-axis represents the degree k and the y-axis repre-sents Pk, where Pk is the fraction of nodes having

degree greater than or equal to k. As is apparentfrom the plot, the cumulative distribution followsa power-law with exponential cut off. The best-fit power-law exponent for the distribution is 1.02.The cumulative degree distribution is more robustto noise and data sparsity than its non-cumulativecounterpart, pk. However, since pk is the negativederivative of Pk, we observe the following rela-tionship between k and pk for partition VW (B isa positive constant of proportionality).

pk = Bk−2.02

Power-law degree distributions are known toemerge in complex networks when the new nodesentering the system connects to the pre-existingnodes through preferential attachment (see (Albertand Barabasi, 2002) for an introduction). For bi-partite networks, Ramasco et al. (2004) proposeda preferential attachment based network growthmodel that can be rephrased in the context ofLWNet as follows: At every time step t, a newlyric node enters the set VL; this node is connectedto n nodes from the VW partition (i.e., n is the av-erage number of unique words per song, which isapproximately 52 for LWNet); out of n nodes orwords, m words are newly introduced in the VWset (in other words, every song introduces m newwords in the song corpus); the remaining n − mwords are chosen from the existing set of wordsin VW through a degree based preferential attach-ment, where the probability of a node to be chosenfor connection is directly proportionate to the cur-rent degree of the node (i.e., the number of songsin which the word already occurs).

Theoretical analysis of this model shows thatthe degree distribution, pk, of the VW partitionin the emergent network will follow a power-lawwith exponent

γ = 2 +m

n− mSince, we observe that γ = 2.02 and know thatn = 52, we can compute m from the above equa-tion which turns out to be equal to 1.02. In otherwords, every new Bollywood song on average in-troduces only 1.02 new words to the corpus. Thisagain points to the limited diversity in the vocabu-lary of the Bollywood songs.

4.3 Topological Properties of WCNLTable 2 lists some of the basic topological char-acteristics of WCNL as well as for the word co-

Figure 4: Degree distribution of LWNet for the setVW in log-log scale.

occurrence networks constructed in similar fash-ion for the standard Hindi corpus and the two do-main restricted corpora. Note that in order to havea meaningful comparison of the topological prop-erties, we ensured that the number of tokens in thestandard text corpora is the same as in the Lyricscorpus. While Table 2 provides a bird’s eye viewof the four networks and their restricted counter-parts, in this subsection we will discuss these topo-logical properties in details and highlight their lin-guistic significance in the context of the currentstudy.

Connection Density: First, we note that eventhough all the networks were constructed fromsimilar size corpora, the number of nodes in thenetworks (N ), i.e., the number of unique wordsin the corpora, is least for Bollywood Lyrics cor-pus. The number of unique words in the gen-eral corpus is almost twice and in the domain spe-cific networks almost 1.5 times that of the lyricscorpus. This further strengthens the fact that vo-cabulary of Bollywood lyrics is very repetitive.The average degree of a node, k, gives an indi-cation of the density of a network. Higher val-ues of k indicates densely connected networks; kclose to N indicates that the network is a cliqueor a complete graph. We observe that for boththe restricted and unrestricted versions of the net-work, the corresponding k is highest for lyrics andlowest for the standard corpus, and the values forthe domain specific corpora are between these twoextremes. High average degree for WCNL indi-cates that even though the vocabulary for Bolly-wood lyrics is restricted, the words are used in var-ious diverse contexts. In other words, lyricists doshow their creativity by putting these words in var-

Figure 5: Cumulative degree distribution ofWCNL in log-log scale.

ious combinations and various orders, even if theyare reluctant to use completely new vocabulary ormetaphors.

Degree Distribution: Figure 5 shows the cu-mulative degree distribution of WCNL, whichseems to be a two-regime power-law. The power-law fits are also shown for better visualization.The power-law exponents γ1 and γ2 for the two-regimes are 0.69 and 1.40, which implies that thecorresponding exponents for the non-cumulativedegree distribution are 1.69 and 2.40 respectively.This values are comparable to those reported byCancho and Sole (2001a), which are 1.5 and 2.7respectively. Interestingly, we do not observe apronounced two-regime power-law for the corre-sponding degree distributions for the standard anddomain specific corpora. The degree distributionsfor those can be best approximated by a singlepower-law with exponential cut-offs. The valuesof the exponents for the cumulative degree distri-bution are reported in Table 2, and the correspond-ing exponents for the non-cumulative distributionslie between 2.18 and 2.27.

Kernel-Peripheral Lexicon: A two-regimepower-law is thought to emerge from presence ofdistinct kernel and peripheral lexicon in a lan-guage. While the kernel lexicon consists of func-tion words and other very commonly used vocabu-lary, the peripheral lexicon consists of domain spe-cific vocabulary that may not be uniformly presentacross all the documents in a corpus. The ab-sence of a two-regime power-law for the stan-dard Hindi corpora is due to the small size ofthe corpora which is not sufficient to bring outthe kernel-periphery distinction. However, for thelyrics corpus, this distinction is prominent, per-

haps due to the limited vocabulary usage of Bol-lywood lyrics which makes it possible to observea word in context a lot more times in the corpusthan for the standard text corpus. As can be ob-served from Figure 1, the kernel lexicon consistsof approximately 1000 words and the remaining17000 words are in the peripheral lexicon givinga kernel-to-periphery ratio of 17 for Bollywoodlyrics. For standard English, this ratio is approx-imately 85 (Cancho and Sole, 2001a). A signif-icantly low kernel-to-periphery ratio again indi-cates lack of diversity in Bollywood lyrics. Byknowing the 1000 words that constitute the ker-nel of Bollywood lyrics, one should be able to“comprehend” most of the Bollywood songs al-most completely, because the peripheral words aremuch fewer and by definition they occur only spo-radically across songs.

Clustering Coefficient: Given a node u, theprobability that two nodes v and w, both of whichare connected to u, are also connected to eachother is known as the clustering coefficient of uor C(u). The average of the C(u)’s of all nodes inthe network is defined as the clustering coefficientof the network orC. As reported in Table 2, all thenetworks have almost identical values of C, whichis around 0.65 for the unrestricted and 0.34 for therestricted versions.

In general we would expect a higher clusteringcoefficient for a network that has higher connec-tion density. One way to factor out the effect of theconnection density on clustering coefficient is tolook at C/Crand instead of C, where Crand is theclustering coefficient of a random graph that hasthe same connection density as the original net-work. Crand is given by the expression k/N (Can-cho and Sole, 2001a). This ratio is around 300,540 and 775 (190, 350 and 485, for restricted ver-sions) respectively for the networks constructedfrom Bollywood lyrics, domain specific text cor-pus and standard text corpus. Thus, even thoughall the networks feature much higher C than theircorresponding random networks, WCNL has a rel-atively lower C to Crand ratio.

Thus, while the connection density of WCNLincreases due to repeated use of the same words invarious novel contexts, the clustering coefficientremains the same as that for the standard corpusnetwork. This apparent fallacy can be understoodas follows: If u (say man) co-occurs with v (saykA) and v with w (say dil), then the co-occurrence

of u with w is typically controlled by the syntacticconstraints imposed on the language (in this case,it is unlikely that man and dil will co-occur in anysong, no matter how many songs are observed),rather than frequencies of occurrence. To summa-rize, clustering coefficient of word co-occurrencenetwork is influenced by the syntactic properties,especially the local or intra-phrasal word orderingconstraints, of a language and not the vocabularyusage. Since the clustering coefficients of all thenetworks are same, we can infer that Bollywoodlyrics feature the same local syntactic constraintsas the standard Hindi language. This is an interest-ing observation because one would normally ex-pect Bollywood lyrics to feature more unrestrictedword ordering, especially because Hindi is a rel-atively free-word order language, and moreover,lyrical constructions in general exploit and breakword ordering restrictions. Nevertheless, our anal-ysis do not support this apparently intuitive fact.

Diameter: Formally, the diameter of a graph isdefined as the longest shortest path in the network.However, it is often approximated by the averageshortest path which is easier to estimate both the-oretically and algorithmically than the true diame-ter. Table 2 reports the average shortest path of thenetworks (d), which all lie between 2.56 and 2.80.Again, we observe that all the networks have verysimilar diameters that are also close to drand - thediameter of the corresponding random graphs withsame connection density.

Small World Property: Three conditions mustbe satisfied in order for a network to be small-world: Sparseness (N >> k), d ≈ drandand C >> Crand (Watts and Strogatz, 1998).All these three properties are satisfied by all thenetworks, implying that word co-occurrence net-works are indeed small-worlds irrespective of thestyle (stories, news reports or lyrics) and do-main (Bollywood, business or religion) of the un-derlying corpora. Small-world nature of wordco-occurrence networks has been associated withfaster access of words from the mental lexi-con (Cancho and Sole, 2001a). Thus, it is a univer-sal property of natural languages and Bollywoodlyrics are no exceptions.

5 Discussion and Conclusion

In this paper, we presented some basic statisticalas well as more in-depth complex network analy-ses of Bollywood song lyrics and standard Hindi

Table 2: Topological characteristics of the Word Co-occurrence Networks of Bollywood Lyrics, standardlanguage corpus and domain specific corpora. ? marked values indicate the power-law exponent of thefirst regime whenever two-regime power law was observed. (Un)Rest: (Un)restricted, Gen: Generalcorpus, Bus: Business corpus, Rel: Religion and Astrology

Network |N | |E| k γ γ2 C Crand C/ d drand×103 ×104 ×10−3 Crand

Unrest. Lyrics 17.9 35.2 39.4 0.69? 1.40 0.66 2.2 300 2.64 2.66Rest. Lyrics 17.9 28.1 31.4 0.74? 1.61 0.32 1.7 188 2.77 2.84Unrest. Gen. 31.1 38.8 24.9 1.20 − 0.62 0.8 775 2.59 3.21Rest. Gen. 31.1 33.7 21.7 1.26 − 0.34 0.7 486 2.75 3.36

Unrest. Bus. 23.2 32.3 27.8 1.19 − 0.64 1.2 533 2.56 3.02Rest. Bus. 23.2 27.1 23.4 1.27 − 0.35 1.0 350 2.74 3.19

Unrest. Rel. 21.8 30.7 28.1 1.18 − 0.65 1.2 542 2.63 2.99Rest. Rel. 21.8 25.6 23.4 1.26 − 0.34 1.0 340 2.80 3.17

text corpora to understand the similar and distinctfeature of the language of Bollywood lyrics ascompared to that of standard Hindi. Some of thesalient facts that came up from our analysis are:

Limited Vocabulary: The vocabulary used forcomposing Bollywood lyrics is small and limited,as compared to standard and even domain specificcorpora. This is evident from the facts that (a)there are only 17000 unique words in 0.5 Millionword lyrics corpus, (b) word frequency distribu-tion significantly deviates from Zipf’s law, and (c)every new lyric adds, on an average, only 1.02 newwords.

Small Kernel-to-periphery ratio: There arearound 1000 words that form the kernel of Bolly-wood lyrics vocabulary. The size of the peripherallexicon is less than 20 times that of the kernel (thisvalue is close to 100 for standard text).

Creative Word Usage: Very high connectiondensity of WCNL implies that the limited vocabu-lary is very aptly used in various combinations toyield interesting and novel compositions.

Syntactic Restrictions: Same clustering coeffi-cient across the networks suggest that the syntacticrestrictions are no more or no less severe in Bolly-wood lyrics as compared to standard Hindi. This iscontrary to the popular belief that poetry tends toenjoy more liberty from the strict rules of syntax.

Small-world Property: All the networks stud-ied are small-worlds, which seems to be a basicproperty of every human language irrespective ofthe style or domain of the corpora.

Thus, most of the differences in the networkproperties of Bollywood lyrics and standard Hinditext seem to be explainable by one key point

that the vocabulary of Bollywood lyrics is muchsmaller than one would expect for a corpus of cre-ative compositions of similar size in Hindi. Whyis the vocabulary of Bollywood lyrics so restrictedand repetitive? This is a very interesting ques-tion whose answer would require a holistic anal-ysis taking into consideration socio-cultural, lin-guistic and historical factors. This is beyond thescope of the current work, though we are temptedto make certain speculations.

Firstly, we note that the themes of the Bol-lywood songs are quite repetitive. Secondly,metaphors used also lack in diversity, some ofthem being derived from literature of the 17thand 18th centuries. For instance, Radha-Krishnametaphor is commonly used in romance and de-votion themes; similarly, description of body partsinvolves a few very standard metaphors. There-fore, we hypothesize that lyricist in particular, andBollywood artists in general have tried to stick toa well tested beaten path instead of exploring glar-ingly new themes and styles. This tendency mighthave emerged due to a strong bias or preference(often modeled as preferential attachment in com-plex network literature) towards reusing artistic orlinguistic structures that have been popular in thepast, downplaying creativity to minimize risks inthe Indian film industry. The alternative, but muchless plausible, hypothesis could be that the con-sumers of this industry themselves do not prefervariation.

Since there are no previous studies on lyrics inother languages to compare our results against, itis hard to comment whether the aforementionedcharacteristics are unique to Bollywood or in gen-

eral true for poetry or movie songs. Therefore, onestarting point for investigating some of these inter-esting questions would be to conduct similar anal-ysis on other poetry and lyrics corpora. Very re-cently there has been a wave of change in themesas well as metaphors of Bollywood lyrics. Sinceour corpus consists of songs mainly till 2005, itis not possible to study these recent trends fromthe current corpus. Nevertheless, it would be in-teresting to study the trends and dynamics of wordusage in Bollywood lyrics over last 80 years.

Acknowledgement

This work was conceived as a project during theIASNLP-2011 summer school at IIIT Hyderabad.Therefore, we would like to thank the organizersof IASNLP-2011. Also, thanks to Kanika Guptafor her help in crawling and de-duplication of Bol-lywood lyrics.

ReferencesAlbert, R. and Barabasi, A.- L. 2002. Statistical me-

chanics of complex networks. Reviews of ModernPhysics 74, 47–97.

Gena R. Bennett 2010. Using Corpora in the Lan-guage Learning Classroom: Corpus Linguistics forTeachers. Michigan ELT.

Akshar Bharati, Vineet Chaitanya and Rajeev Sangal1995. Words and their Analyzer. Natural LanguageProcessing : A Paninian Perspective, Prentice Hallof India.

Monojit Choudhury, Diptesh Chatterjee, and Ani-mesh Mukherjee 2010. Global topology of wordco-occurrence networks: Beyond the two-regimepower-law. in Proceedings of Coling 2010.

Monojit Choudhury and Animesh Mukherjee. 2009.The structure and dynamics of linguistic networks.In N. Bellomo,N. Ganguly, A. Deutsch, and A.Mukherjee, editors, Dynamics On and Of Com-plex Networks, Modeling and Simulation in Sci-ence, Engineering and Technology, pages 145–166.Birkhauser Boston.

Ferrer-i-Cancho, R. and Sole, R.V. 2002. Zipf’s lawand random texts. Advances in Complex Systems,Vol. 5, No. 1 (2002) 1-6

Ferrer-i-Cancho, R. and Sole, R.V. 2001a. The smallworld of human language. Proceedings of the RoyalSociety of London B, 268(1482):2261-2265.

Ferrer-i-Cancho, R. and Sole, R.V. 2001b. Tworegimes in the frequency of words and the origin ofcomplex lexicons: Zipf’s law revisited. Journal ofQuantitative Linguistics, 8:165-173.

Gries, S. T. 2010. In A. Sanchez and M. Almela(Eds.) A mosaic of corpus linguistics: Selectedapproaches (pp. 269-291). Germany: Peter Lang,Frankfurt am Main

Jayaram, B. D., Vidya, M. N. 2008. Zipf’s Law forIndian Languages. Journal of Quantitative Linguis-tics, Vol. 15, No. 4., pp. 293-317.

Andras Kornai. 2002. How many words are there?Glottometrics 4, 61-86.

Animesh Mukherjee, Monojit Choudhury, AnupamBasu, and Niloy Ganguly 2008. Modeling theStructure and Dynamics of the Consonant Invento-ries: A Complex Network Approach. in Proceed-ings of Coling 2008.

Mehler, A. 2008. Large text networks as an objectof corpus linguistic studies. Corpus Linguistics. AnInternational Handbook of the Science of Languageand Society In A. Ludeling and M. Kyto (Eds.), pp.328-382. Berlin/New York: De Gruyter.

Newman, M. E. J. 2003. The structure and functionof complex networks. SIAM Review 45, 167–256.

Newman, M. E. J. 2001. Scientific collaboration net-works. Phys. Rev. E 64

Fernando Peruani, Monojit Choudhury, AnimeshMukherjee, and Niloy Ganguly. 2007. Emergenceof non-scaling degree distribution in bipartite net-works: a numerical and analytical study. Euro-physics Letters, vol. 79, no. 28001.

Ramasco, J. J., Dorogovtsev, S. N. and Pastor-Satorras,R. 2004. Self-organization of collaboration net-works. Physical Review E, 70(036106).

Rishiraj Saha Roy, Niloy Ganguly, M. Choudhury andA. Mukherjee 2011. Complex Network AnalysisReveals Kernel-Periphery Structure in Web SearchQueries. SIGIR.

Sitabhra Sinha, Raj Kumar Pan, Nisha Yadav, MayankVahia, Iravatham Mahadevan. 2009. Network anal-ysis reveals structure indicative of syntax in the cor-pus of undeciphered Indus civilization inscriptions.in Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing,ACL-IJCNLP 2009, pages 5-13

Watts, D. J. and Strogatz, S. H.. 1998. Collec-tive dynamics of ‘small-world’ networks. Nature393(6684):440-442

George K. Zipf 1949. Human Behavior and the Prin-ciple of Least Effort. Addison-Wesley.