potholes and bad road conditions- mining twitter to ... · transport and highways3 (morth),...

11
Potholes and Bad Road Conditions- Mining Twier to Extract Information on Killer Roads Swati Agarwal Birla Institute of Technology and Science (BITS), Pilani Goa, India [email protected] Nitish Mittal noon.com Dubai, UAE [email protected] Ashish Sureka Ashoka University Haryana, India [email protected] ABSTRACT Research shows that Twitter is being used as a platform to not only share and disseminate the information but also collecting com- plaints from citizens. However, due to the presence of high volume and large stream data, real-time manual identification of those com- plaints is overwhelmingly impractical. In this paper, we identify the complaints and grievances posted on bad road conditions causing life risks, discomfort and poor road experience to the citizens. We formulate the problem of killer road complaints identification as a multi-class text classification problem. We address the challenge of keyword based flagging methods and identify several linguistic features that are unique for the killer road complaints tweets such as the issue reported in the complaint, pinpoint location of the issue, city or region location information. Our results reveal that not all complaint reports posted to Public agencies contain the sufficient information and are not useful. Therefore, we further propose a mechanism to enrich the nearly-useful tweets and convert them into useful reports. We present our results using information visualization and gain actionable insights from them. Our results show that the proposed features are discriminatory and able to classify killer roads complaints with an accuracy of 67% and a recall of 65%. KEYWORDS Bad Roads Complaints, Information Visualization, Lexical Knowl- edgeBase, Named Entities, Natural Language Processing, User Gen- erated Data 1 INTRODUCTION Bad road conditions such as road irregularities, roughness, pot- holes, bumps, patchy surface and poorly designed speed breakers make the road risky and hazardous for drivers resulting in accidents 1 . Similarly, several crashes happen due to blind or improper curves, temporary diversions, traffic on under repair and under construction roads. Dysfunctional or dim street lights, encroachments on roads by shops, hawkers and broken or hard to see road signboards are also major causes of accidents 2 . Bad road conditions based (the focus of the work presented in this paper) causes are different than acci- dents happening due to indiscipline or careless driving (driver’s fault such as speeding or drunken driving) or poor weather conditions (rain, storm, and snow). As per the statistics of Ministry of Road Transport and Highways 3 (MORTH), Government of India, on an average 400 road deaths take place every day in India 4 . According to 1 http://www.thehindu.com/news/cities/bangalore 2 http://indianexpress.com/article/cities/pune/ 3 http://morth.nic.in 4 Accidents statistics in India MORTH statistics, in 2015, India had 501, 423 road accidents caus- ing 146, 133 deaths while 10, 727 people were killed in the crashes happened because of potholes. According to the reported accidents and newspaper statistics, Delhi recorded the highest number of road accident deaths (17 deaths per hour) in India in 2015 5 . In the first quarter of 2016, over 300 deaths were reported in Uttarakhand state where accidents happened due to the sharp turns and absence of crash barriers. While, in Maharashtra, Meghalaya, Madhya Pradesh and Uttar Pradesh state, the maximum number of deaths happened due to the potholes and under-repaired roads and highways 6 . Previous studies shows that due to the increasing trend of adap- tation of social media by Indian Government and public agencies to reach out to the people, public citizens use Twitter to post their complaints and report the incidents to the concerned authorities [11][7][1]. The complaints on killer roads contain the information about road irregularities and other issues causing high risks and discomfort to the citizens. Figure 1 shows concrete examples of bad road conditions complaints reported to the Indian Government’s official Twitter handle. Identification of such complaints reports on Twitter and resolving them in a timely manner is an important problem for the government and concerned authorities. However, the manual analysis of such complaints and gaining insights is over- whelmingly impractical due to the high volume and massive size of the data. Twitter statistics reveal that official twitter handle of gov- ernment authorities and ministry of road, transport and highways in India receives an approximate of 6-10 complaint tweets per minute. Further, despite several efforts made by the government, in order to identify such complaints manually, many reports remain unad- dressed. Kumar et al. [9] propose an approach to mine Twitter users’ online communication for identifying potential road safety hazards that pose driving risks. Gu et al. [6] propose a methodology to extract incident information on both highways and arterial roads of Pitts- burgh and Philadelphia Metropolitan Areas. Fu et al. [5] describe an approach to extract and analyze real-time traffic related twitter data for incident management purpose. present a real-time moni- toring system for detecting and analyzing the traffic incidents and congestion via Twitter communications. Schulz et al. [13] present a solution for allowing to increase the situational awareness by real- time identification of small-scale incidents. Panagiotopoulos et al. [12] present a study on examining the use of Twitter for making public awareness during emergencies such as heavy snow and public riots. 5 http://auto.ndtv.com/news/road-traffic-deaths-highest-in-delhi-1417473 6 http://timesofindia.indiatimes.com/india/400-road-deaths-per-day-in-India-up-5-to-1-46-lakh-in-2015/ articleshow/51919213.cms

Upload: trinhdang

Post on 10-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Potholes and Bad Road Conditions- Mining Twitter to ... · Transport and Highways3 (MORTH), Government of India, on an average 400 road deaths take place every day in India4. According

Potholes and Bad Road Conditions- Mining Twitter to ExtractInformation on Killer Roads

Swati AgarwalBirla Institute of Technology and

Science (BITS), PilaniGoa, India

[email protected]

Nitish Mittalnoon.com

Dubai, [email protected]

Ashish SurekaAshoka University

Haryana, [email protected]

ABSTRACTResearch shows that Twitter is being used as a platform to not

only share and disseminate the information but also collecting com-plaints from citizens. However, due to the presence of high volumeand large stream data, real-time manual identification of those com-plaints is overwhelmingly impractical. In this paper, we identify thecomplaints and grievances posted on bad road conditions causinglife risks, discomfort and poor road experience to the citizens. Weformulate the problem of killer road complaints identification as amulti-class text classification problem. We address the challengeof keyword based flagging methods and identify several linguisticfeatures that are unique for the killer road complaints tweets suchas the issue reported in the complaint, pinpoint location of the issue,city or region location information. Our results reveal that not allcomplaint reports posted to Public agencies contain the sufficientinformation and are not useful. Therefore, we further propose amechanism to enrich the nearly-useful tweets and convert them intouseful reports. We present our results using information visualizationand gain actionable insights from them. Our results show that theproposed features are discriminatory and able to classify killer roadscomplaints with an accuracy of 67% and a recall of 65%.

KEYWORDSBad Roads Complaints, Information Visualization, Lexical Knowl-

edgeBase, Named Entities, Natural Language Processing, User Gen-erated Data

1 INTRODUCTIONBad road conditions such as road irregularities, roughness, pot-

holes, bumps, patchy surface and poorly designed speed breakersmake the road risky and hazardous for drivers resulting in accidents1. Similarly, several crashes happen due to blind or improper curves,temporary diversions, traffic on under repair and under constructionroads. Dysfunctional or dim street lights, encroachments on roads byshops, hawkers and broken or hard to see road signboards are alsomajor causes of accidents 2. Bad road conditions based (the focusof the work presented in this paper) causes are different than acci-dents happening due to indiscipline or careless driving (driver’s faultsuch as speeding or drunken driving) or poor weather conditions(rain, storm, and snow). As per the statistics of Ministry of RoadTransport and Highways3 (MORTH), Government of India, on anaverage 400 road deaths take place every day in India4. According to1http://www.thehindu.com/news/cities/bangalore2http://indianexpress.com/article/cities/pune/3http://morth.nic.in4Accidents statistics in India

MORTH statistics, in 2015, India had 501,423 road accidents caus-ing 146,133 deaths while 10,727 people were killed in the crasheshappened because of potholes. According to the reported accidentsand newspaper statistics, Delhi recorded the highest number of roadaccident deaths (17 deaths per hour) in India in 20155. In the firstquarter of 2016, over 300 deaths were reported in Uttarakhand statewhere accidents happened due to the sharp turns and absence ofcrash barriers. While, in Maharashtra, Meghalaya, Madhya Pradeshand Uttar Pradesh state, the maximum number of deaths happeneddue to the potholes and under-repaired roads and highways6.

Previous studies shows that due to the increasing trend of adap-tation of social media by Indian Government and public agenciesto reach out to the people, public citizens use Twitter to post theircomplaints and report the incidents to the concerned authorities[11][7][1]. The complaints on killer roads contain the informationabout road irregularities and other issues causing high risks anddiscomfort to the citizens. Figure 1 shows concrete examples ofbad road conditions complaints reported to the Indian Government’sofficial Twitter handle. Identification of such complaints reportson Twitter and resolving them in a timely manner is an importantproblem for the government and concerned authorities. However,the manual analysis of such complaints and gaining insights is over-whelmingly impractical due to the high volume and massive size ofthe data. Twitter statistics reveal that official twitter handle of gov-ernment authorities and ministry of road, transport and highways inIndia receives an approximate of 6-10 complaint tweets per minute.Further, despite several efforts made by the government, in orderto identify such complaints manually, many reports remain unad-dressed. Kumar et al. [9] propose an approach to mine Twitter users’online communication for identifying potential road safety hazardsthat pose driving risks. Gu et al. [6] propose a methodology to extractincident information on both highways and arterial roads of Pitts-burgh and Philadelphia Metropolitan Areas. Fu et al. [5] describean approach to extract and analyze real-time traffic related twitterdata for incident management purpose. present a real-time moni-toring system for detecting and analyzing the traffic incidents andcongestion via Twitter communications. Schulz et al. [13] present asolution for allowing to increase the situational awareness by real-time identification of small-scale incidents. Panagiotopoulos et al.[12] present a study on examining the use of Twitter for makingpublic awareness during emergencies such as heavy snow and publicriots.

5http://auto.ndtv.com/news/road-traffic-deaths-highest-in-delhi-14174736http://timesofindia.indiatimes.com/india/400-road-deaths-per-day-in-India-up-5-to-1-46-lakh-in-2015/articleshow/51919213.cms

Page 2: Potholes and Bad Road Conditions- Mining Twitter to ... · Transport and Highways3 (MORTH), Government of India, on an average 400 road deaths take place every day in India4. According

Figure 1: Concrete Examples of Complaints on Killer Roads and Citizens’ Discomfort Reported to Official Twitter handle of Ministryof Road, Transport and Highways, Government of India- Addressing Various Road and Transport Related Issues such as Pothole,Dysfunctional Streetlights, Unplanned Construction on Highway and Breakers near Intersection and Sharp Turns.

Motivation: The research presented in this paper is motivatedby the existing literature and a need to develop a system that canautomatically identify complaint reports and overcome the challengeof manual inspection. The key motivation of this paper is to extractimportant information from tweets such as the exact geographicallocation which can be used to locate the fault and present the infor-mation to the end-users of such applications (government agenciesresponsible for monitoring and repairing roads). However, the auto-mated identification and analysis of such complaints is a challengingproblem. Due to the presence of free-form text tweets do not havea defined structure or language format and hence are high likely tohave grammar and spell errors. Further, the presence of multilingualtexts and scripts in tweets, it is challenging to identify the linguisticfeatures for building NLP based applications. Text classificationor categorization and information extraction from tweets is thus atechnically challenging problem. The specific research aim of thework presented in this paper is to investigate rule-based and machine-learning based text classifiers to identify bad road conditions reports.Our research aim is also to investigate information extraction andvisualization to extract useful information and insights from thecomplaint reports.

2 RESEARCH CONTRIBUTIONS:We conduct a literature survey in the area of mining Twitter

data for identifying complaints and reports on bad road conditions.Based on our survey, we observe that most existing studies i) focuson mining users’ online communication on social media and newswebsites [9] [6], ii) building a prediction based model and situationawareness model for identifying real-time emergency situations such

as heavy snowfall [12] and traffic issues [4] [14] [10]. However,in this paper, we mine the public citizens’ complaints reported onTwitter and extract the insights and information that are useful for thegovernment. In contrast to the existing studies, the specific and keycontributions of the study presented in this paper are the followings:

(1) To the best of our knowledge, the study presented in thispaper is the first study on mining citizen’s complaints andreports on killer roads posted on official Twitter handle ofGovernment.

(2) We investigate the efficacy of spatial, contextual and linguisticfeatures for identifying useful information from complaintposts. We build a text-analysis based model to enrich spatialfeatures (geographical location metadata) in a tweet that canbe used to discover insights from less informative reports.We propose a content-disambiguation model to enrich ourlinguistic features for identifying exact problem reported inthe tweets.

(3) We publish the first database of citizens’ complaints on killerroads and highways reported to official public agencies onTwitter. We make our dataset [2] publicly available for theresearch community so that our results can be used for bench-marking, comparison and further extension.

We expect our results to be useful for the local government (Ministryof Road, Transport, and Highways) and concerned authorities (road,highway and traffic engineers) interested in identifying and fixing thecauses of road accidents. Our analysis might help public agenciesto perform data analytics on killer road reports and identify thegeographical locations and issues in a time-efficient manner.

2

Page 3: Potholes and Bad Road Conditions- Mining Twitter to ... · Transport and Highways3 (MORTH), Government of India, on an average 400 road deaths take place every day in India4. According

Table 1: Statistics of Experimental Dataset- Illustrate the num-ber of Original and Unique tweets Collected After Data Ex-traction, Filtering and Sampling. Showing the Number of Sam-pled Tweets Consisting of Contextual Metadata (Image, Video,URL, Hashtag, @username mention). TS denotes the Times-tamp (dd/mm) of the Tweet.

Collection

Total Original Re-tweet Replied81,304 17,511 52,701 11,092English Unique Sampled Users13,368 13,208 3,302 2,604

Tweets

Image Video URL Hashtag598 0 854 821@mention Lat-Long Min TS Max TS1673 8 20/07 Max 13/09

3 EXPERIMENTAL SETUP3.1 Dataset Collection and Characterization

In the work presented in this paper, we conduct our experimentson an open source Twitter dataset collected in real-time. We iden-tify two official Twitter handle of concerned departments of IndianGovernment. @nitin_gadkari is the official account of Mr. NitinGadkari- current Union Minister of Road Transport & Highwaysand Shipping in India. @MORTHIndia is the official Twitter accountof Ministry of Road Transport and Highways, Government of India.We use official Twitter REST API7 and search the tweets mentioningthe screen name (along with @ symbol) of these accounts. TwitterREST API allows us to extract only past 7 days of data. Therefore,we collect the tweets posted on these accounts in real time for 8weeks (from July 18, 2016 till September 13, 2016). Using TwitterAPI, we were able to collect a total of 81,304 tweets consisting of17,511 original tweets, 11,092 replied tweets and 52,701 re-tweets.In the work presented in this paper, we conduct our experiment onlyon English-language tweets. Therefore, we use "detect language"8

feature of Twitter API and identify the non-English-language tweets.Further, the aim of our study is to identify linguistic features ofcomplaint reports and build a text classifier. Therefore, we also fil-ter the tweets consisting of only images, videos, and URLs and notext- identified as undefined (und) using detect language feature ofTwitter API. Table 1 shows the statistics of tweets remaining afterfiltering the non-English and unidentified language posts. In order toremove the bias from our data, we filter the duplicate records andperform a random sampling on our data to select a subset of tweetsfor our experiments. Table 1 reveals that after filtering and randomsampling (25% of unique tweets) of records, there are 3,302 uniqueand English-language tweets posted by 2,604 unique users.

Multi-media Content: We observe that users embed externalentities in their tweets to provide more details about the issue re-ported in their complaints. Therefore, in addition to extracting thetextual content of tweets, we also download the available contex-tual metadata using Twitter REST API. We extract the availablehashtags, @username mentions and count of URLs, images, andvideos present in the tweet. We also extract the geo-location (latitude

7https://dev.twitter.com/rest/reference/get/search/tweets8https://dev.twitter.com/rest/reference/get/help/languages

and longitude) of the users if enabled by the user while posting thetweet. Table 1 shows the number of tweets (after random sampling)consisting of various distinct contextual metadata. Table 1 revealsthat only 18% (598 out of 3,302) of the tweets in our experimentaldataset contain external images and 25% (821 out of 3,302) of thetweets contain distinct hashtags while only 8 out of 3,302 tweetscontain geo-location information of the users posting these tweets.

3.2 Dataset Enhancement and EnrichmentIn this Section, we preprocess the sampled dataset and address

the challenge of noisy content in the tweets. We use the micropostenrichment algorithm proposed in our previous study Mittal et al.[11]. The proposed algorithm performs a syntactic enhancement ofthe tweets and consists of five phases primarily named as hashtagexpansion, @username mention expansion, spell error correction,acronym and slang treatment and sentence segmentation. In thispaper, we address the limitations of the previously proposed methodand improve the accuracy of enrichment by varying the ordering ofphases and number of iterations of some specific phases.

1. Sentence Segmentation (SSN): In the first phase of the mi-cropost enrichment model, we enrich the syntactic structure of atweet. We remove the filler terms (such as umm, hmm, errr) from thetweets. We also replace special characters appearing consecutivelywith one character. For example, ????? and !!!! are replaced with? and ! respectively. Twitter does not identify the hashtags writtenconsecutively or without spaces before "#". For example, a patternof hashtags "#DwarakaExpressWay#DelhiRoads#Potholes", Twitterdoes not recognize any of the hashtags. We perform a cleaning onthe extracted tweets and add a padding whitespace before every #and later trim all extra whitespaces occurring consecutively.

2. Hashtag Expansion (HTE): In hashtag expansion, we expandthe joint hashtags used as a descriptive hashtag created by combiningmore than one words (alphanumeric strings or special characters).We use the method proposed in our previous study Mittal et al. [11]and split the long hashtag into disjoint words. For example, "#Mum-baiPotholes", and #WakeupCM are split into "Mumbai Potholes",and "Wake up CM" respectively.

3. Padding Space Correction: Similar to the addition of spacepadding before hashtag (#), we add a whitespace before and after allspecial characters (comma, period, question mark, colon, semicolon,underscore and exclamatory marks). We further trim the consecu-tive extra whitespaces, fixing the presence and absence of spacesbefore and after the special characters. We, however, do not addwhitespaces before the special characters appearing in @usernamementions. For example, @poonam_mahajan.

4. Spell Error Correction (SEC): We pre-process each tweet inour experimental dataset and correct the possible spell errors. Weuse the application of Bing Search Engine AP for spell correctionand extract the n-grams resulted as "Including results for". For agiven sentence, ’pleas answr my query asap’, we create a set of three3-grams [n1: ’pleas answr my’, n2: ’answr my query’, n3: ’my queryasap’]. Here, ’answr’ is replaced by ’answer’ only if both n1 and n2

3

Page 4: Potholes and Bad Road Conditions- Mining Twitter to ... · Transport and Highways3 (MORTH), Government of India, on an average 400 road deaths take place every day in India4. According

Table 2: Concrete Examples of Tweets Recorded Before and After Executing the Micropost-Enrichment Algorithm Consisting of FivePhases (not in sequence)- User Mention Expansion (UME), Hashtag Expansion (HTE), Spell Error Correction (SEC), Acronym andSlang Treatment (AST) and Sentence Segmentation (SSN)

Before AfterSSN @AamAadmiParty Umm nice.What about Vadra, Khurshid

, Sheila Dixit ... LoL. Where is the 370 pages report???@nitin_gadkari

@AamAadmiParty nice. What about Vadra, Khurshid, SheilaDixit. LoL. Where is the 370 pages report? @nitin_gadkari@ArvindKejriwal

HTE #InfrastructureMatters: @Nitin_Gadkari approves of #Delhi-Amritsar e-way! #KnowMore https://goo.gl/Lk5uzn

Infrastructure Matters: @Nitin_Gadkari approves of Delhi Am-ritsar e-way! Know More https://goo.gl/Lk5uzn

SEC It bcms vry dffclt 2 walk on #Delhi roads.Parnts hv gvnvhicls 2 their kids who knws nthn.2de I savd myslf 4 tims@nitin_gadkari @MORTHIndia

It becomes very difficult 2 walk on #Delhi roads. Parents havegiven vehicles 2 their kids who knows nothing. 2de I savedmyself 4 times @nitin_gadkari @MORTHIndia

AST @sureshpprabhu learn 4m shipping where 1 accident y navigatorleads 2 life time ban & cancellation of licence coordinate wid@nitin_gadkari

@sureshpprabhu learn from shipping where 1 accident why nav-igator leads 2 life time ban & cancellation of license coordinatewith @nitin_gadkari

UME @MORTHIndia @nitin_gadkari @PMOIndia @naren-dramodi pls take appropriate action. I ve raised so many com-plains. No response whatsoever

MORTHINDIA Nitin Gadkari PMO India Narendra Modi plstake appropriate action. I ve raised so many complains. No re-sponse whatsoever

are corrected as ’please answer my’ and ’answer my query’.

5. Acronyms and Slang Treatment (AST): We expand the acronymsand slang present in our tweets in three stages: domain specific slang,standard slang, and user slang. In order to avoid the risk of noisein pre-processing, the algorithm proposed in Mittal et al. takes aninput of domain specific acronyms. We do a manual identificationon Twitter and identify several acronyms and slang commonly usedin road and transport related tweets. For example, NH (NationalHighway), Toll, RTO (Regional Transport Office), RC (Vehicle Reg-istration Certificate). If a term does not exist in domain specificslang keywords then we replace it with user slang. User slang are theslang that are commonly used on social media but are not presentor recognized as standard slang used for those words. For example,PL is used for please whereas, it is defined as "parent looking" in astandard slang dictionary. We identify the 1, 2 and 3 n-grams presentin the dataset and create an exhaustive list of such slang. In the thirdstage of slang and acronyms treatment, we replace the remainingslang with their standard definitions. We, however, do not replaceany numeric character unless it is an alphanumeric string. For exam-ple, semantically, 4 can mean either word "for" or "four" while "r8"is used for "right" and "b4" is used for "before".

6. Username Expansion (UME): Expansion of direct mentionsreplaces the twitter @screenname with the user profile name makingthem easily recognizable by the named entity recognizers. For exam-ple, @Dev_Fadnavis, and @Ra_THORe are expanded to DevendraFadnavis and Rajyavardhan Rathore.

In order to minimize the noise in output tweet, we apply domainspecific slang conversion after every other phase of data enrichment.We observe that applying spell correction phase directly after hashtag

expansion can rather create the noise in the data. For example, a hash-tag #CMODelhi is expanded as "CMO Delhi" corrected as "COMDelhi" while CMO is an abbreviation used for "Chief Minister of".We use enriched tweets for feature extraction and classification andcall them as Experimental Dataset 1 (ED1). Table 2 shows examplesof killer road complaints tweets recorded before and after executingeach phase of micropost enrichment algorithm.

4 FILTERING NON-COMPLAINT TWEETSAs discussed in Section 1, the official public service agencies

account are open and anyone can mention them in their tweets. Weobserve that not all tweets posted on these accounts are complaintreports and rather are either off-topic or discussions not relevant tothe complaint department. In order to classify a complaint reporton bad road conditions, dysfunctional facilities or irregularities, weidentify various features that are a strong indicator of a tweet to bea complaint report. However, we also observe that there are severaldiscriminatory features that are a strong indicator of a tweet cer-tainly not to be complaint report. Based on our manual inspectionon official Twitter handle and non-complaint tweets identificationmodel proposed in our previous study [11], we divide such tweetsinto 4 categories (AISP): Appreciation posts, Information Sharingand Promotional tweets.

Appreciation tweets are the posts made by citizens for praisingthe government for their work or resolving their previous complaints.Information sharing tweets are the tweets posted by users to sharedaily news about the events or policies initiated by the government.Further, due to the presence of several official accounts on Twitter,we categorize tweets posted by a different official account of samepublic agencies categorized as promotional tweets. Table 3 shows ex-amples of appreciation, information sharing, and promotional tweetsposted on official accounts of @nitin_gadkari and @MORTHIndia.

4

Page 5: Potholes and Bad Road Conditions- Mining Twitter to ... · Transport and Highways3 (MORTH), Government of India, on an average 400 road deaths take place every day in India4. According

ED2

Place

Person

Feature Extraction

Rule Based Method

Classification Irrelevant

Useful Tweets (ED4) Incomplete Reports (ED3)

ED2 Geographical

Locations (GL)

Sc>th1 Sc>th2

Place � person

Indico API- NER

OSM API

City State Landmark

Figure 2: Proposed Framework for Identifying the Geographical Location

Table 3: Concrete Examples of Non-Complaint Tweets- Clas-sified into Four Categories: Appreciation (APP), InformationSharing (IS) and Promotional (PRL) Tweets.

AISP Tweet

APP Big big move @mansukhmandviya ji. Thank you forpersonally monitoring this. Thanks to @MORTHIndiaand @nitin_gadkari ji also for this

IS Commuters, get ready for more sun-kissed rideson the waters https://goo.gl/SlrCWE @PiyushGoyal@nitin_gadkari

PRL Shri @nitin_gadkari lays foundation stones for 12 Na-tional Highways projects. #TransformingIndia

We use the AISP tweet classifier method proposed in our previousstudy [11] and classify AISP tweets and filter the unknown poststhat may or may not be a complaint report about killer roads. We usethese unknown tweets for further feature extraction and classificationand call them as Experimental Dataset 2 (ED2).

5 FEATURES EXTRACTION ANDSELECTION

We observe that all the complaints that are reported do not containnecessary and sufficient information about the issue faced by the peo-ple and hence are left unaddressed or unresolved despite consideredby the authorities. Another fact that we come across is that unlikeother department complaints, killer road related complaints can havefewer chances to get addressed based on the amount of informationavailable in the tweets. For example, in a railway department, thefollowing two complaints "no blankets are available in S3 coach ofHemkund Express." and "Boarded Hemkund Express from HaridwarJunction. No blankets are available on the train. My PNR numberis 12345" have similar possibility to get addressed. Whereas, in thecomplaints related to killer roads, a tweet "under-construction roadsnear Nehru Place metro station are causing accidents." is high likelyto get addressed in comparison to the tweet "Nehru place roads areaccident prone. Delhi roads are worst". We create a hypothesis thatin order to successfully address a complaint about killer roads it isimportant to identify the three features:

(1) to know the type of issue faced by the citizens so that thecomplaint can be forwarded to the concerned department

(2) it is required to know the region (or city) from which thecomplaint is reported

(3) the exact pinpoint location (or landmark) of the problem

We discuss each of these features and the proposed method for theiridentification in the following subsections:

5.1 Geographical Location IdentificationTwitter microposts are the user-generated data and do not have

a defined structure for writing. Therefore, despite correcting thespell errors and performing sentence segmentation on tweets, wefind that existing named entity recognizers are not able to identifylocations in tweets with 100% accuracy. Further, we find that inIndian locations scenario, the majority of locations are identified aspersons’ names. For example, for the locations "Mahatma GandhiMarg", "Vasant Vihar Colony", "PVR Saket"- "Gandhi", "Vasant"and "Saket" are identified as persons’ names. Therefore, we usea combination of Named Entity Recognizer and GeoCoding APIto extract the geographical entities mentioned in tweets. Figure 2demonstrates the high-level framework for identifying a geographi-cal location in tweets.

We use Indico Text Analysis API9 to extract the specific placesand person referred in the tweets. Test Analysis model of Indico APIallows a machine to discover a variety of knowledge from plain textinput using Machine Learning and Deep Learning methods. Given atweet ti ∈ ED2 as plain text input, Indico API returns a list of JSONobject of three key-value pairs. Each object represents a substring oftweet ti- identified as a person or place, prediction confidence scoreand position of substring (starting and ending character) in the inputtweet. The confidence score varies between 0 to 1 representing theconfidence of text analysis model for identifying it to be a place orperson.

Using Indico API, for a given tweet input ti, we extract allthe locations L = {L1,L2, ....,Li | 0 ≤ i ≤ lengthti} and personsP = {P1,P2, ....,Pj | 0 ≤ j ≤ lengthti}. Based on the score of ver-ified locations, we divide the confidence score CS of locationsinto three categories- low (CS < 0.2), medium (0.2 ≤ CS < 0.5)and high (0.5 < CS). For each location Li ∈ L with a confidencescore CSLi > 0.5 and 0.2 ≤CSLi < 0.5, we filter them as LHigh andLMedium respectively. While we discard the entities identified as alocation with a confidence score below 0.2. Table 4 shows examplesof tweets and the confidence score of entities to be identified as lo-cation. As discussed above, due to the ambiguity in location names,we observe that many places are identified as persons and vice versa.Therefore, we discard the places having a medium confidence scorebut also identified as person names. The extracted geographical loca-tion entities for a tweet ti are GL = LHigh ∪

(LMedium −P

). Further,

if GLm ∈ GL is a substring of GLn ∈ GL then we discard GLm and

9https://indico.io

5

Page 6: Potholes and Bad Road Conditions- Mining Twitter to ... · Transport and Highways3 (MORTH), Government of India, on an average 400 road deaths take place every day in India4. According

Core NLP

ED2

Place

Person

Feature Extraction

Rule Based Method

Classification Irrelevant

Useful Tweets (ED4) Incomplete Reports (ED3)

ED2 Geographical

Locations (GL)

Sc>th1 Sc>th2

Place � person

Indico API- NER

OSM API

City State Landmark

Noun ED2

POS Tagging

Adjective/ Adverb

Lexical Database

Semantic Similarity

High Road Related

Terms

High

Low

Convert Noun in Adjective form

Related

Related

Discard

Core NLP ED2

POS Tagging

Adjective/ Adverb Lexical

Knowledgebase

Semantic Similarity

High

Road Related Distinct Terms

High

Low

Convert Noun in Adjective form

Related

Related

Discard Noun

<W1, W2, W3, W4,….., W8> C1: W1 W2 W3…|| C2: W1 W4 W5… C3: W3 W5 W6…|| C4: W5 W7 W8…

Figure 3: High-Level Diagram of the Proposed Framework for Mining Tweets for Identifying Issues Related to Killer Roads.

Table 4: Concrete Examples of Location and Person Names Identified using Indico Text Analysis Model- Illustrates the Entity andConfidence Score to be a Place or Person’s Name

Tweet Place PersonAccident at Wadkhal. Pls confirm if traveling towards Alibaug. Truckbroke into 2 due to pothole.

<Wadkhal:0.73, Alibaug:0.71> -

Pathetic highway since ages between Roorkee to haridwar. Corruptministry. I bet if you drive on own

<haridwar:0.95, Roorkee:0.60, Pathetichighway:0.03>

<Roorkee:0.06>

Table 5: List of All Map Features and Amenities Identified byOpenStreetMap API for the Tweets Present in Our Experimen-tal Dataset and Labeled as Landmark

Private, administrative, locality, canal, residential, hospital, ham-let, aerodrome, station, trunk, commercial, construction, neighbor-hood, common, bank, motorway_junction, suburb, kindergarten,river, bus_stop, bus_station, school, restaurant, unclassified, motor-way, hotel, secondary, peak, primary, water, public_building, bicy-cle_parking, service, house, stadium, railway

filter only GLn entities identified as locations. For example, in thethird example shown in Table 4, if "Vasant Vihar" is identified as alocation then so are the "Vasant" and "Vihar". Therefore, we take thelongest subsequence of the strings identified as locations and select"Vasant Vihar" as the location.

In order to identify the type of a location (region or landmark),we use OpenStreetMap (OSM) API10. OSM API takes geographicallocations GL as an input and identifies their geocode (latitude andlongitude). Further, it also identifies the type of place (such as road,building, school, highway etc) based on the using the tags associatedwith their basic form. Each tag represents the geographic feature ofthe place. We group these features into two categories: region andlandmark. We define a region as the name of a village, town, city,district and tertiary. Whereas, we define a place to be a landmark

10https://wiki.openstreetmap.org/wiki/API_v0.6

if it tells the pinpoint location, nearby area or locality. For exam-ple, geographic features such as school, church, hospital or buildingare labeled as landmarks. Table 5 shows the name of all amenitieslabeled as landmarks.

5.2 Problem or Issue IdentificationAs discussed above the public agencies accounts on Twitter are

open and therefore, any one tag them in their complaints. However,during our inspection on Twitter, we observe that despite beinga complaint report not all tweets posted on @MORTHIndia and@nitin_gadkari are related to killer roads. Further, in order to iden-tify the complaint tweet, it is important to detect the informationabout topic or cause of the issue. For example, in a tweet "Big pot-holes on Delhi roads. It’s risky to drive at nights", identification ofterm "pothole" is important to know that the tweet belongs to a killerroad category. Tweets are user generated data that contains free-formtext and high likely to have different terminology for reporting simi-lar complaints. For example, "streetlights on highway not working","dim streetlights on highway", "highway is full of dark", "there isa complete blackout on highway" are reporting the same complaintusing different terminology.

In our proposed method, we address the challenge of keywordbased flagging methods for identifying the issues and problems re-ported in tweets. Figure 3 shows the high-level research frameworkfor topic identification. We apply CORE NLP parser on the tweetsidentified as unknown after executing AISP classification (ED2).We perform POS Tagging on each tweet and identify the nouns,

6

Page 7: Potholes and Bad Road Conditions- Mining Twitter to ... · Transport and Highways3 (MORTH), Government of India, on an average 400 road deaths take place every day in India4. According

Table 6: Key-terms Related to Various Issues Reported in KillerRoad Complaints

Related Terms[C1] street light, traffic light, drainage[C2] pothole, street light, traffic light, hawkers, breaker, sign

board, construction, intersection, diversion, animal, turn,crash, barrier

[C3] road, highway, traffic, flyover, underpass, bypass, chowk,expressway

[C4] barrier, repair, sign board, pothole, parking, police

Table 7: A Snapshot of ConceptNet Distance Between TermsPresent in Tweets and Issues Related to Killer Roads

highway traffic flyover animal constructioncongestion 0.1 0.34 0.03 0 0.01street_light 0.85 0.50 0.28 0.12 0.02intersection 0.73 0.4 0.31 0.03 0.04hawker 0.26 0.53 0.35 0.12 0

adjective, and adverbs in our posts. We filter the terms that havealready been tagged as places or persons during geographical loca-tion identification. In order to address the limitations of keywordbased flagging methods, we compute the semantic similarity be-tween the terms presented in a tweet and the terms related to road,transport, and highway (RTH) complaints. Research shows that alexical database can be used to compute semantic similarity betweentwo terms [3]. We use Conceptnet commonsense knowledgebase11

as a lexical resource and identify the conceptual similarity betweenthe terms presented in the tweets and RTH complaints. ConceptNetis an ontology-based semantic network in which each node repre-sents a concept and each edge represents the relation between twoconcepts. ConceptNet is an open source knowledgebase consistingof concepts extracted from common and informal sentences anddaily basic knowledge [8]. It not only identifies the shortest pathbetween two concepts but also identifies the common-sense associ-ations that users make among the concepts. Therefore, due to thefree-form structure of tweets, we select ConceptNet over WordNet12

and further, ConceptNet is an extension over WordNet and DBPe-dia13 lexical resources.

We conduct a manual inspection on Twitter handle of @MOR-THIndia and @nitin_gadkari and identify several issues reported bypublic citizens about bad road conditions. Based on our observation,we define four categories of killer road complaints: [C1] Dysfunc-tional facilities, [C2] Risky and Hazardous (accident prone), [C3]Poor Conditions and [C4] Indiscipline and Carelessness. For eachcategory, we define certain key-terms and compute their conceptualsemantic similarity (using ConceptNet) with the terms present inthe tweet. Table 6 shows the list of all key-terms defined in theabove-mentioned categories. Table 6 also reveals that there are sev-eral terms such as street light, signboard, and pothole that occur inmore than one category. Therefore, we create a lexicon of all distinct

11http://conceptnet5.media.mit.edu12https://wordnet.princeton.edu13http://wiki.dbpedia.org/projects/dbpedia-lookup

words and compare them with the terms present in the tweet andidentified as a Noun. The high confidence score of similarity com-putation shows that the tweet is high likely to be in the associatedcategory. We use association method of ConceptNet to access theconcepts and compute their semantic similarity. We use Concept-Net5.0 API and find the association between two given conceptsusing [assoc/c/en/concept1?concept2/] request while we filter resultscomputed for only English-language concepts using [filter=/c/en/].Given a tweet t = {t1, t2, t2, ...., tn}, a set of killer road complaintsrelated key-terms K = {C1 ∪C2 ∪ ....∪C j} where 1 ≤ j ≤ 4 and anetwork of knowledgebase Nconcept . We find a ti ∈ t where POSti ∈{NN,NNP,NNS,NNPS}, ti < GL and ti < P (refer to Section 5.1).Then, ∀Kp ∈ K | K = {K1,K2, ...Km}, we compute the confidencescore with each ti as CStiKp =Conceptual_Similarityti,Kp,Nconcept .Therefore, for each ti ∈ t and Kp ∈ K, we get a two-dimensionalvector Scorem,n matrix.

CStK=

t1 t2 . . ti . . tnK1 0.4 0.5 . . 0.23 . . 0.50K2 0.3 0.02 . . 0.7 . . 0.4. . . . . . . . .. . . . . . . . .

Kp 0.24 0.25 . . 0.3 . . 0.7. . . . . . . . .. . . . . . . . .

Km 0.7 0.8 . . 0.6 . . 0.9

If, ∃Kp ∈ K : CStiKp ≥ th then the topic of a complaint is saidto be true for tweet t. Whereas ∀Kp ∈ K : CStiKp ≤ th, we convertthe adjective form of noun words if available. If ∀ti ∈ t and ∀Kp ∈K, the confidence score between concepts CStiKp ≤ th, the topicof a complaint is said to be false for t. Table 7 shows concreteexamples of conceptual similarity score (confidence) score computedbetween various concepts- noun terms present in the tweets and topic-related keywords. For each ti, we select all Kp such that CStiKp ≥ thand assign various categories (Dysfunctional Facilities, Risk andHazardous, Poor Condition and Indiscipline & Carelessness) to thetweet. ∀ j, C j=TRUE for tweet t, if Kp ∈C j | 1 ≤ j ≤ 4. We, howeverfind several such examples, where the problem identification ischallenging not only for the machine but also for human annotation.Such reports contains humor, ambiguity and sarcasm and increasethe number of false alarms in problem identification. For example,"Nitin Gadkari Come to Gurugram! Enjoy the fun of Venice forfree! Great offer by Haryana Government!", "Nitin Gadkari NH17is National Horror neglected by Govt. After Govt", and "But thishighway is like a black spot on moon".

6 CLASSIFICATIONAs discussed above, Twitter users can direct mention a public

agency account in their tweets while reporting a complaint. However,due to no restriction to the direct mention, users tag these accountsin many non-complaint and AISP tweets. Based on our inspection ofavailable information in the tweets posted on @MORTHIndia and@nitin_gadkari, we define three classes of tweets: Useful tweets,Nearly-Useful tweets, and Irrelevant tweets. We use Rule-basedclassifier trained on the features extracted in Section 5 i.e. 1) problem

7

Page 8: Potholes and Bad Road Conditions- Mining Twitter to ... · Transport and Highways3 (MORTH), Government of India, on an average 400 road deaths take place every day in India4. According

Table 8: Concrete Examples of Reports Identified as IrrelevantTweets (IRT), Useful Tweets (UT) and Nearly-Useful Tweets (N-UT) for Addressing and Resolving the Complaint

Type TweetsIRT can we take some action on such wastage of public time

n fuel as it’s wastage of natural resourcesUT Sir your attention please. these are craters on Mum-Goa

Highway near Chiplun. Being hopefulN-UT people are struggling with pothole roads in national capi-

tal Delhi, do something about that

or issue reported in the tweet, 2) city or town or region mentioned inthe complaint and 3) landmark or pinpoint location of the complaint.

(1) Irrelevant Tweets (IRT): We define a tweet as irrelevant ifthe post is not about the poor conditions and irregularitiesof roads or highways causing life risks, discomfort, hazardor poor experience to the citizens. Table 8 shows examplesof such tweets posted on official accounts of Governmentauthorities of Road and Transport department. Table 8 revealsthat the tweet labeled as irrelevant is either off-topic or theauthors discuss the problems related to road and transport;however the complaint is not about the poor conditions ofroads or faulty and dysfunctional facilities.

(2) Useful Tweets (UT): Useful tweets Ut are the posts whichare a clear indicator of complaints and can be used to iden-tify the low-level details of the issue faced by the citizens.Given a tweet ti and a set of named entities X =< Tc,Tl ,Tp >,we define ti as a useful tweet- ti ∈Ut if ti ∈ N (dataset) andti = {X ,To | ∀x ∈ X : P

(x)

where P(x), φ}. While, Tc de-

notes the name of city or region, Tl denotes exact geograph-ical location or landmark, Tp denotes the problem or issuereported in the complaint and To denotes other words in thetweet. Table 8 shows examples of tweet consisting of detailedinformation about the city, landmark location, and the prob-lem reported in the tweet. For each ti ∈Ut , we refer them asExperimental Dataset 3 i.e. ED3.

(3) Nearly-Useful Tweets (N-UT): Nearly-Useful tweets are thetweets posted for complaining a report but containing incom-plete or insufficient information about the issue. For example,missing the exact location of the problem, ambiguity in report-ing the issue, defining the problem on an abstract level andlacking the details. Given a tweet ti, we define it as a nearly-useful tweet ti ∈ NUt if ti ∈ N and ti = {X ,To | ∃x ∈ X : P

(x)

where P(x)= φ}. Table 8 shows examples of tweets posted

as complaint reports but providing insufficient information.For simplicity of tweets, we removed the @username men-tions from the tweets and used the corrected and enrichedform of tweets (after applying micropost-enricment algo-rithm). Table 8 reveals that in a N-UT complaint, the au-thor has mentioned the problem of potholes in the city ofNew Delhi while the exact location of potholes is missingfrom the complaint. For each ti ∈ NUt , we refer them asExperimental Dataset 4 i.e. ED4.

6.1 Enrichment of Incomplete ReportsIn this phase of proposed approach, we convert nearly useful

tweets into useful tweets by extracting possible information aboutmissing entities. As discussed above, a complete complaint reportis comprised of three major components: city Tc (or landmark orregion), pinpoint location Tl , and the problem Tp reported in thetweet T . However, based on the type of missing information, it isnot always possible to convert an incomplete or nearly-useful tweetto a useful and complete report. For example, in the absence of issueexpected to be reported in the complaint, it is hard to identify theproblem component since the complaint is subjective and can beabout any topic or it can be completely irrelevant as well. Therefore,we discard the tweets identified as nearly-useful tweets with no is-sue mentioned in the reports. We apply the geographical locationhierarchy model (bottom-up) and use graph backtracking methodto identify the region or locality for a given pinpoint location in thetweet. For a given pinpoint location Tl , we use OpenStreetMap APIand extract the detailed information associated with a geographicallocation. For example, for IIIT-Delhi, we are able to extract Govind-Puri Metro Station as landmark, Phase 3 as locality in Okhla regionwhich is located in Delhi city. We use OpenStreetMap API to enrichthe region and city information Tc for each tweet consisting of onlypinpoint locations Tl .

Since it is possible for two different locations to have the samename, we further use Twitter user location to verify the identifiedcity information. We extract the location of all distinct users avail-able in our experimental datasets 4 (ED4- Nearly Useful Tweets).If the city or state information mentioned on their Twitter profilematches with the location identified using OpenStreetMap API thenwe show the 100% accuracy of the location component enrichmentotherwise we proceed with the enriched results without claimingthe 100% accuracy in enrichment. For example, in a tweet "Majorblunder by DMRC at Huda City Center Metro Station. Entry to Exitplatform Clashing", "Huda City Center" is detected as a locationentity. We use Using OpenStreetMap, we get the location or cityname as Gurgaon which is also mentioned as the users’ location inhis Twitter profile. Furthermore, for a given region or city, it is notpossible to trace in the geographical location hierarchy model andidentify the pinpoint location of the problem. For example, for thetweet "Mumbai can be termed as the pothole capital of India. Epicmess of @SwachhBharatGov.", the reported problem aims at thegeneralized picture of Mumbai roads and therefore it is infeasibleto identify the exact location of the problem. While using Open-StreetMap, only state information can be extracted as Mumbai is acity and the capital of Maharashtra state which does not provide anyinformation about the problem’s location.

7 EMPIRICAL ANALYSIS AND EVALUATIONRESULTS

In this Section, we discuss the empirical analysis performed onthe tweets, acquired experimental results and the characterizationstudy performed on the complaint reports. We present the accu-racy results of rule-based classifier and also discuss the influenceof user-generated content on the overall accuracy of the proposedapproach. We evaluate the accuracy of our classifier by comparingthe observed results against actual labeled class. We conduct our

8

Page 9: Potholes and Bad Road Conditions- Mining Twitter to ... · Transport and Highways3 (MORTH), Government of India, on an average 400 road deaths take place every day in India4. According

Table 9: Confusion Matrix Results for the Rule-based Clas-sifier for Identifying Irrelevant (IRT), Nearly-Useful Tweets(N-UT) and Useful Tweets (UT)

PredictedN-UT IRT UT Total

ActualN-UT 1088 131 59 1278

IRT 376 569 17 962UT 254 34 94 382

Total 1718 734 170 2,622

Table 10: Distribution of Complaint Reports Classified into4 Different Categories. Further, the Table Reports the Num-ber of Reports Classified into More than 1 Categories.

C1 C2 C3 C4Dysfunctional Facilities (C1) 402 345 282 44

Risky and Hazardous Roads (C2) - 1142 933 145Poor Condition Amenities (C3) - - 1042 92

Indiscipline and Carelessness (C4) - - - 252

experiments on 3,302 random sampled tweets collected for @MOR-THIndia and @nitin_gadkari and report the accuracy of AISP andcomplaint tweets classifiers. Proposed AISP classifier identifies atotal of 20.5% (680 out of 3302) tweets as certainly non-complaint(AISP) reports. Among 680 AISP tweets, 417 tweets are identifiedas news or information sharing tweets. While, 166 and 97 tweets areidentified as appreciation and promotional tweets respectively. Basedon our AISP experimental results, we perform rule-based classifieron the remaining 2,622 tweets and identify useful, nearly-useful andirrelevant complaint reports.

Table 9 reveals that a very small percentage of tweets (6.4%- 170tweets out of 2622 reports) are identified as complete reports thatcontains all three important components of a killer road complaint.Whereas, the largest chunk of reported tweets is classified as incom-plete or nearly-useful tweets (1718 reports out of 2622 tweets). Ourexperimental results reveal that further only a very small percentageof tweets are convertible (N-UT-C) into complete or useful tweets(50 tweets out of 1718 nearly-useful tweets) while 97% (1668 out of1718) of nearly-useful tweets have either landmark or concrete prob-lem component missing from the tweets and hence not-convertible.Table 11 shows the example of tweets present in our dataset andclassified as nearly-useful tweets which further cannot be enrichedor converted (N-UT-N). Table 9 reveals that due to the large percent-age of reports with missing or incomplete information that is notpossible to enrich, it is technically challenging to identify each andevery complaint tweet efficiently.

In order to measure the performance of our classifier, we usestandard metrics of Information Retrieval and compute the overallaccuracy of the proposed approach. Table 9 shows the confusionmatrix of the proposed classifier. Since the proposed approach isa multi-class classifier, we compute the performance of each class.Based on the results acquired by our rule-based classifier, we clas-sify bad road related complaints with an overall accuracy of 67%. Inaddition to measuring the accuracy of our classifier, we also measurethe overall recall value of the classification. Based on our results andas shown in Table 9, we achieve a recall of 65%. As discussed inSection 5.2, the complaints reported to public agencies’ accounts areuser-generated content and lacks a standard format or terminologyfor complaining a report. Further, the excessive use of metaphorand sarcasm while reporting a complaint generates false alarms andimpacts the overall accuracy of the classification.

7.1 Characterization of Complaint ReportsWe perform a feature-based characterization on the tweets classi-

fied into 4 major categories of killer road complaints (Section 5.2):

Table 11: Examples of Tweets Present in Our Experimen-tal Dataset and Classified into Non-Convertible Nearly-UsefulTweets. Table also Illustrates the Available and Missing Compo-nent in the Tweets.

Tweet Available MissingMumbai contrast. Is this really thesame city. Nitin Gadkari.

<Tc> <Tl , Tp>

Nitin Gadkari even police vans donot follow traffic rules

<Tp> <Tl , Tc>

[C1] Dysfunctional facilities, [C2] Risky and Hazardous (accidentprone), [C3] Poor Conditions and [C4] Indiscipline and Carelessness.Based on our results and as illustrated in Figure 3, we find that thereare several complaints that belong to more than one category. We,however, report the results only for the complaints that are classi-fied into one or two categories. Table 10 shows the distribution ofcomplaint reports (including nearly-useful tweets), the maximumnumber of complaints (1142 and 1042) are reporting about the riskyand hazardous roads which are accident prone and poor conditionsof amenities and roads respectively. While, the minimum number ofcomplaints (252) are reported about the risk caused by indisciplineor carelessness of driving. Table 10 also shows that further a largepercentage of complaints (933) reported on the risky and accidentprone roads are due to poor condition of the roads.

We also observe that the complaints related to risky and accidentprone roads contains the terms and issues like barricade, cracker,expressway, highway, chowk, zebra, street, turn, impediment, flyover,construction and builders. Whereas, the terms like pothole, track,pedestrian, road, bypass, intersection, truck, traffic, passage, bus, un-derpass are the commonly used words in the tweets reporting aboutpoor conditions of roads. Further, the words like barrier, stop, res-cue, arrest, constable, policy, police, tunnel are the common wordsused in the reports complaining about the carelessness of drivers orauthorities.

In order to identify the source of maximum complaints on killerroads, we perform a characterization on the location feature of thetweets. Figure 4 shows the distribution of geographical locationsidentified in the complaints reports (complete and nearly-usefultweets). Figure 4(a) reveals that citizens post complaints to @MOR-THIndia and @nitin_gadkari from all regions of India. Figure 4(b)shows the city-level distribution of complaint reports posted from allover the India. Figure 4(b) reveals that despite receiving complaints

9

Page 10: Potholes and Bad Road Conditions- Mining Twitter to ... · Transport and Highways3 (MORTH), Government of India, on an average 400 road deaths take place every day in India4. According

(a) Distinct States (b) Distinct Places

Figure 4: Illustrating the Distribution of Distinct Geographical Locations (Places and States) Identified in the Complaint Reports inour Experimental Dataset

from every state of India, there are some states from where the maxi-mum numbers are reported. The map presented in Figure 4(b) showsthat the cities of Mumbai, Delhi, Haryana, Bihar and Uttar Pradeshreport complaints more relative to the cities of states like Karnatka,Telangana, Odisha or North East region of the country.

8 CONCLUSIONS AND FUTURE WORKIn this paper, we present our study on identification of complaints

reported in road irregularities and bad road conditions. We addressthe challenge of noisy and user-generated data by performing syn-tactic enrichment on raw-tweets. We also publish the first databaseof complaint reports on killer roads posted on official Twitter han-dle of Indian public agencies. We propose to use the application ofConceptNet lexical knowledgebase, named entity recognizers andgeocoding location APIs to extract important components of a killerroad complaint; such as problem reported in the complaint, land-mark or pinpoint location, city or location information. Based on theavailable components in the tweets, we classify them into three cate-gories: useful tweets (UT), nearly-useful tweets (N-UT), irrelevanttweets (IRT). We further propose a mechanism to enrich the N-UTsand convert them into UTs. Our results shows that the proposedapproach classifies these complaint reports with an accuracy of 67%and a recall of 65%. We perform a characterization on complaint re-ports and our results reveal that maximum number of complaints arereported about the risky and accident prone roads while most of themare due to the poor condition of amenities. Further, the complaints

are reported from all over the India while maximum complaints arereported from Maharashtra, New Delhi, Uttar Pradesh, and Biharstates.

Future work includes addressing the limitations of present studyby improving the accuracy of location identification. Further, thereare several complaints which contains humor, sarcasm and ambiguity.Such tweets does not provide information about the issue reported inthe complaint. Our future work includes the use of metaphor analysisfor extracting information from such ambiguous posts.

REFERENCES[1] Swati Agarwal. 2017. Open source social media intelligence for enabling govern-

ment applications. ACM SIGWEB Newsletter Summer (2017), 3.[2] Swati Agarwal, Nitish Mittal, and Ashish Sureka. 2017. Syntactic Enhancement

of Killer Road Complaint Tweets Posted on Twitter, Mendeley Data, v1, DOI:10.17632/dm6s252524.1. (2017).

[3] Swati Agarwal and Ashish Sureka. 2015. Using common-sense knowledge-base for detecting word obfuscation in adversarial communication. In 2015 7thInternational Conference on Communication Systems and Networks (COMSNETS).IEEE, 1–6.

[4] Eleonora et al. D’Andrea. 2015. Real-time detection of traffic from twitter streamanalysis. IEEE Transactions on Intelligent Transportation Systems 16, 4 (2015),2269–2283.

[5] Kaiqun Fu, Rakesh Nune, and Jason X Tao. 2015. Social media data analysisfor traffic incident detection and management. In Transportation Research Board94th Annual Meeting.

[6] Yiming Gu, Zhen Sean Qian, and Feng Chen. 2016. From Twitter to detector: Real-time traffic incident detection using social media data. Transportation ResearchPart C: Emerging Technologies 67 (2016).

[7] Fitrie et al. Handayani. 2016. THE USE OF MEME AS A REPRESENTATIONOF PUBLIC OPINION IN SOCIAL MEDIA: A CASE STUDY OF MEME

10

Page 11: Potholes and Bad Road Conditions- Mining Twitter to ... · Transport and Highways3 (MORTH), Government of India, on an average 400 road deaths take place every day in India4. According

ABOUT BEKASI IN PATH AND TWITTER. HUMANIORA 7, 3 (2016).[8] Catherine Havasi, Robert Speer, and Jason Alonso. 2007. ConceptNet 3: a flexible,

multilingual semantic network for common sense knowledge. In Recent advancesin natural language processing. Citeseer, 27–29.

[9] Avinash Kumar, Miao Jiang, and Yi Fang. 2014. Where not to go?: detectingroad hazards using twitter. In Proceedings of the 37th international ACM SIGIRconference on Research & development in information retrieval. ACM, 1223–1226.

[10] Eric Mai and Rob Hranac. 2013. Twitter interactions as a data source for trans-portation incidents. In Proc. Transportation Research Board 92nd Ann. Meeting.

[11] Nitish Mittal, Swati Agarwal, and Ashish Sureka. 2016. Got a Complaint?-Keep Calm and Tweet It!. In Advanced Data Mining and Applications: 12thInternational Conference, ADMA 2016, Gold Coast, QLD, Australia, December12-15, 2016, Proceedings 12. Springer, 619–635.

[12] Panos Panagiotopoulos, Julie Barnett, Alinaghi Ziaee Bigdeli, and Steven Sams.2016. Social media in emergency management: Twitter as a tool for communicat-ing risks to the public. Technological Forecasting and Social Change 111 (2016),86–96.

[13] Axel Schulz, Petar Ristoski, and Heiko Paulheim. 2013. I see a car crash: Real-time detection of small scale incidents in microblogs. In Extended Semantic WebConference. Springer, 22–33.

[14] Napong et al. Wanichayapong. 2011. Social-based traffic information extrac-tion and classification. In ITS Telecommunications (ITST), 11th InternationalConference on. IEEE.

11