user review sites as a resource for large scale sociolinguistic studies

15
USER REVIEW SITES AS A RESOURCE FOR LARGE-SCALE SOCIOLINGUISTIC STUDIES By, Ashutosh Bhargave. Anders Johannsen, Dirk Hov, Anders Søgaard University of Copenhagen

Upload: ashutosh-bhargave

Post on 11-Apr-2017

173 views

Category:

Data & Analytics


0 download

TRANSCRIPT

USER REVIEW SITES AS A RESOURCEFOR LARGE-SCALE SOCIOLINGUISTIC STUDIES

By,Ashutosh Bhargave.

Anders Johannsen, Dirk Hov, Anders SøgaardUniversity of Copenhagen

OUTLINE: Introduction Data Format Data Augmentation Representativeness Pilot Studies Conclusion

Sociolinguistic studies

Problems:• Traditional approach.• Social media data

Remedy:• Paper aims to remedy both problems by

exploring a large new data source, international review websites with user profiles.

languageextra-

linguistic variables

Relation

DATA FORMAT:

• The Trustpilot Corpus consists of user reviews from the Trustpilot website.

• Users need to register with a username in order to leave review• no mandatory fields other than the name• assign unique identifiers to both users and companies and use those to link up reviews.• mostly interested in age, gender, and location in combination with the written reviews.

DATA AUGMENTATION Augmented the retrieved data set in two

ways, 1. gender information based on 1st names,

and2. geo tagging information (latitude &

longitude)

Problems - 1. no gender information2. “canonical" town

REPRESENTATIVENESS restricted to the age range from 16 to 80. median age in our data is typically close to

the country's median value. more male than female users

average number of reviews per user is around 4

PILOT STUDIES Discovering gender-specific words :

Emoticons, age, and gender

Eyes ( : ; ) Nose ( - or none) Mouth ( ( , ) , [ ,* etc) women use emoticons almost twice as often as men do for all ages, the use of a nose is highly anti correlated

with age

Ratings, categories, gender, and age

men tend to vote slightly more negative than women people in the younger group are more likely to use

negative ratings than people in the older group

DENMARK missing distinction between the reflexive

possessive pronouns and non-reflexives record the frequency of sin/sit (his/her own)

and the joint frequency of all possessive pronouns(his). Then compute the ratio of the former in all pronouns.

Swear words across location, gender, and age:• as people grow older, they tend to use more conservative language• women use this stronger version words less than the men

GERMAN Replacement : β with ss dass/daβ, “that", and the modal

mussen/muβen, “to must” older speakers retain the traditional spelling

they acquired in their youth to a much greater extent .

CONCLUSION Traditional sociolinguistic studies often lack

statistical power to draw valid conclusions and big-data approaches to language studies mostly lack extra-linguistic information that would enable sociolinguistic studies.

Solution to this dilemma is user review sites.

QUESTIONS ?

THANK YOU