user review sites as a resource for large scale sociolinguistic studies
TRANSCRIPT
USER REVIEW SITES AS A RESOURCEFOR LARGE-SCALE SOCIOLINGUISTIC STUDIES
By,Ashutosh Bhargave.
Anders Johannsen, Dirk Hov, Anders SøgaardUniversity of Copenhagen
Sociolinguistic studies
Problems:• Traditional approach.• Social media data
Remedy:• Paper aims to remedy both problems by
exploring a large new data source, international review websites with user profiles.
languageextra-
linguistic variables
Relation
DATA FORMAT:
• The Trustpilot Corpus consists of user reviews from the Trustpilot website.
• Users need to register with a username in order to leave review• no mandatory fields other than the name• assign unique identifiers to both users and companies and use those to link up reviews.• mostly interested in age, gender, and location in combination with the written reviews.
DATA AUGMENTATION Augmented the retrieved data set in two
ways, 1. gender information based on 1st names,
and2. geo tagging information (latitude &
longitude)
Problems - 1. no gender information2. “canonical" town
REPRESENTATIVENESS restricted to the age range from 16 to 80. median age in our data is typically close to
the country's median value. more male than female users
average number of reviews per user is around 4
Emoticons, age, and gender
Eyes ( : ; ) Nose ( - or none) Mouth ( ( , ) , [ ,* etc) women use emoticons almost twice as often as men do for all ages, the use of a nose is highly anti correlated
with age
Ratings, categories, gender, and age
men tend to vote slightly more negative than women people in the younger group are more likely to use
negative ratings than people in the older group
DENMARK missing distinction between the reflexive
possessive pronouns and non-reflexives record the frequency of sin/sit (his/her own)
and the joint frequency of all possessive pronouns(his). Then compute the ratio of the former in all pronouns.
Swear words across location, gender, and age:• as people grow older, they tend to use more conservative language• women use this stronger version words less than the men
GERMAN Replacement : β with ss dass/daβ, “that", and the modal
mussen/muβen, “to must” older speakers retain the traditional spelling
they acquired in their youth to a much greater extent .
CONCLUSION Traditional sociolinguistic studies often lack
statistical power to draw valid conclusions and big-data approaches to language studies mostly lack extra-linguistic information that would enable sociolinguistic studies.
Solution to this dilemma is user review sites.