crowdsourcing ling 240. what is crowdfunding? crowdsourcing—definition “the practice of...
DESCRIPTION
Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a large number of [non-expert] people, typically via the internet” (OED) Examples: Wikipedia Google Translate FamilySearch IndexingTRANSCRIPT
CrowdsourcingLing 240
What is crowdfunding?
Crowdsourcing—definition “the practice of obtaining information or services by
soliciting input from a large number of [non-expert] people, typically via the internet” (OED)
Examples:• Wikipedia• Google Translate• FamilySearch Indexing
COCA's registers based on publication type
Crowdsourcing• What are the benefits of collecting data through
crowdsourcing?• What are the limitations/weaknesses?• What can be done to ensure that crowdsourcing
workers are giving quality data?
Crowdsourcing in linguistics• Wilhelm Kaeding (1897)
• Thousands of non-experts helped compile and analyze an 11 million word corpus of German
• Oxford English Dictionary (1858 – 1928)• Hundreds of non-expert readers submitted 6 million
quotation slips• Perceptual dialectology
• Dialect perceptions elicited from non-experts
Mechanical Turk (Amazon)• Strengths
• Inexpensive• Fast• Quality control• Access to thousands of people
• Growing body of research strongly supports the quality of MTurk data • E.g., Buhmester et al., 2011; Kittur et al., 2008; Suri & Watts, 2011;
Urbano et al., 2010
Case study--
Register classification• Traditional ‘user’-based approach
• ‘Expert’ classifies texts into registers by simply sampling from the publication type of interest
• Limitations• ‘Publication type’ is not a meaningful criterion for web
documents• Experts can’t agree on register category for internet texts
Corpus • Extracted from the Corpus of Global Web-based English
(GloWbE), constructed by Mark Davies• (Near) random sampling methods used to build the corpus
• Google searches of highly frequent English 3-grams (e.g., is not the, and from the) used to identify URLs
• 800-1000 links for each n-gram (i.e., 80-100 Google results pages)• Davies randomly extracted c. 49,300 URLs from GloWbE
• Only web pages from USA, UK, Canada, Aus., and NZ• Documents < 75 words were excluded • Non-textual material was removed from all web pages (HTML scrubbing and
boilerplate removal) using JusText• 1,445 URLs were excluded from subsequent analysis
because they consisted mostly of photos or graphics. • Final corpus for the study: 48,555 web documents.
People asked to determine mode of passage, then participants, purpose, etc. This led to 7 sub-registers
Crowdsourcing end-user data: Classification• Developed a computer-adaptive survey for register
classification
• Tested the tool through 10 rounds of piloting, resulting in numerous revisions
• Recruited 908 raters through Mechanical Turk
• 6 responses x 4 raters x 49,300 texts = 1.2 million individual ratings
Agreement results for the general register classification of 48,147 web documents(Fleiss’ Kappa = .47, moderate agreement)
• 69% of documents achieved majority agreement• Additional 11.8% are potential 2-way hybrids
4 agree 3 agree 2-2 split 2-1-1 split
No agreement
17,511 15,684 5,682 8,515 755 36.4% 32.6% 11.8% 17.7% 1.6%
Frequencies of general register categories (i.e., documents where 3 or 4 raters were in agreement)
Systematic patterns of disagreement
• 28 different 2-2 combinations are possible in theory
• But, only 7 of those combinations occurred > 100 times in our corpus of 48,000 documents
• Because these are widely attested user-based patterns, we are able to interpret disagreement as a special pattern of agreement
Frequencies of 2-way hybrids that occur 100+ times
Multi-Dimensional analysis• Factor analysis to identify dimensions based on co-
occurrence among a large set of linguistic features• Interpret dimensions functionally• Calculate scores for each text on each dimension
17
Features used by Biber adopted:
Positive features:
Verbs: present tense verbs, mental verbs, do as pro verb, ‑ be as main verb, possibility modals
Pronouns: 1st person pronouns, 2nd person pronouns, it, demonstrative pronouns, indefinite pronouns
Adverbs: general emphatics, hedges, amplifiers
Dependent clauses: that complement clauses (with that deletion), causative adverbial clauses, WH clauses Other: contractions, analytic negation, discourse particles, sentence relatives, WH questions, clause coordination ==================================
Negative features: Nouns, long words, prepositional phrases, attributive adjectives, lexical diversity
The results• Linguistic (use-based) variation across user-based
register categories
Web registers along Dimension 1
Web registers along Dimension 1
What have we learned?• Non-expert users can reliably classify web
documents • At least 1 in 10 internet texts belongs to a hybrid
register category• Publication type ≠ register (at least for the web)
• E.g., blogs showed up in several register categories• Triangulating end-user classifications with linguistic
analysis gives us a more complete understanding of register variation on the web
Web register research: Next steps
• Comprehensive linguistic description of the patterns of register variation on the web
• A new multi-dimensional analysis of web registers• Detailed linguistic descriptions of ‘unique’ web
registers• Automatic prediction of register (‘AGI’)• Automatically coded large corpus of web documents• Extend descriptions to include ‘private’ web registers
Areas for future user-based research• Register classification of printed texts• Reader/listener perceptions• Corpus annotation• Word sense disambiguation
5. The future of crowdsourcing in user-based linguistics• User-based analyses have always happened; now
we can do them in a more valid way using crowdsourcing
• Triangulating use-based linguistic data offers a more complete understanding of discourse
• Linguists are often unable to fully analyze and interpret patterns in use-based datasets, particularly those that are very large
• Harnessing the power of user-based data via crowdsourcing could help us tackle big, difficult problems in linguistics
Mechanical Turk• The name comes from an 18th century machine that
played chess.• A person actually hid inside and played
Mechanical Turk• Amazon's Mechanical Turk is a crowdsourcing tool.• Researchers who need human evaluation can get
data• People who want to make some money help with
the project (less than minimum wage)– Image recognition– Speech processing– Subjective evaluation– Giving opinions– Tagging corpora– Match picture with product
Mechanical Turk• Example: word sense disambiguation in corpora
– What should head be tagged as? Noun or verb?– What does head mean in a sentence?
• They charged the head of finances with the crime. (person with office)
• The beer was flat with no head. (froth)• They were going head first (manner of
movement)• Computers can't do it well but people can
How does it work?
Couldn't people cheat? After reviewing results the requester can
reject a worker When rejected, they don't get paid Workers have approval rates Requesters can choose only workers with
good rates
Advantages Thousands of potential workers available You can get results fast Demographic variety (not just undergrads) Cheap (average $1.40 per hour)
Disadvantages Cheating
Some studies show it's at same rates as in lab
Ways to test “While exercising how often have you had a
fatal heart attack?” It requires money Can't do many types of experiments (RT)