crowdsourcing ling 240. what is crowdfunding? crowdsourcing—definition “the practice of...

CrowdsourcingLing 240

What is crowdfunding?

Crowdsourcing—definition “the practice of obtaining information or services by

soliciting input from a large number of [non-expert] people, typically via the internet” (OED)

Examples:• Wikipedia• Google Translate• FamilySearch Indexing

COCA's registers based on publication type

Crowdsourcing• What are the benefits of collecting data through

crowdsourcing?• What are the limitations/weaknesses?• What can be done to ensure that crowdsourcing

workers are giving quality data?

Crowdsourcing in linguistics• Wilhelm Kaeding (1897)

• Thousands of non-experts helped compile and analyze an 11 million word corpus of German

• Oxford English Dictionary (1858 – 1928)• Hundreds of non-expert readers submitted 6 million

quotation slips• Perceptual dialectology

• Dialect perceptions elicited from non-experts

Mechanical Turk (Amazon)• Strengths

• Inexpensive• Fast• Quality control• Access to thousands of people

• Growing body of research strongly supports the quality of MTurk data • E.g., Buhmester et al., 2011; Kittur et al., 2008; Suri & Watts, 2011;

Urbano et al., 2010

Case study--

Register classification• Traditional ‘user’-based approach

• ‘Expert’ classifies texts into registers by simply sampling from the publication type of interest

• Limitations• ‘Publication type’ is not a meaningful criterion for web

documents• Experts can’t agree on register category for internet texts

Corpus • Extracted from the Corpus of Global Web-based English

(GloWbE), constructed by Mark Davies• (Near) random sampling methods used to build the corpus

• Google searches of highly frequent English 3-grams (e.g., is not the, and from the) used to identify URLs

• 800-1000 links for each n-gram (i.e., 80-100 Google results pages)• Davies randomly extracted c. 49,300 URLs from GloWbE

• Only web pages from USA, UK, Canada, Aus., and NZ• Documents < 75 words were excluded • Non-textual material was removed from all web pages (HTML scrubbing and

boilerplate removal) using JusText• 1,445 URLs were excluded from subsequent analysis

because they consisted mostly of photos or graphics. • Final corpus for the study: 48,555 web documents.

People asked to determine mode of passage, then participants, purpose, etc. This led to 7 sub-registers

Crowdsourcing end-user data: Classification• Developed a computer-adaptive survey for register

classification

• Tested the tool through 10 rounds of piloting, resulting in numerous revisions

• Recruited 908 raters through Mechanical Turk

• 6 responses x 4 raters x 49,300 texts = 1.2 million individual ratings

Agreement results for the general register classification of 48,147 web documents(Fleiss’ Kappa = .47, moderate agreement)

• 69% of documents achieved majority agreement• Additional 11.8% are potential 2-way hybrids

4 agree 3 agree 2-2 split 2-1-1 split

No agreement

17,511 15,684 5,682 8,515 755 36.4% 32.6% 11.8% 17.7% 1.6%

Frequencies of general register categories (i.e., documents where 3 or 4 raters were in agreement)

Systematic patterns of disagreement

• 28 different 2-2 combinations are possible in theory

• But, only 7 of those combinations occurred > 100 times in our corpus of 48,000 documents

• Because these are widely attested user-based patterns, we are able to interpret disagreement as a special pattern of agreement

Frequencies of 2-way hybrids that occur 100+ times

Multi-Dimensional analysis• Factor analysis to identify dimensions based on co-

occurrence among a large set of linguistic features• Interpret dimensions functionally• Calculate scores for each text on each dimension

17

Features used by Biber adopted:

Positive features:

Verbs: present tense verbs, mental verbs, do as pro verb, ‑ be as main verb, possibility modals

Pronouns: 1st person pronouns, 2nd person pronouns, it, demonstrative pronouns, indefinite pronouns

Adverbs: general emphatics, hedges, amplifiers

Dependent clauses: that complement clauses (with that deletion), causative adverbial clauses, WH clauses Other: contractions, analytic negation, discourse particles, sentence relatives, WH questions, clause coordination ==================================

Negative features: Nouns, long words, prepositional phrases, attributive adjectives, lexical diversity

The results• Linguistic (use-based) variation across user-based

register categories

Web registers along Dimension 1

What have we learned?• Non-expert users can reliably classify web

documents • At least 1 in 10 internet texts belongs to a hybrid

register category• Publication type ≠ register (at least for the web)

• E.g., blogs showed up in several register categories• Triangulating end-user classifications with linguistic

analysis gives us a more complete understanding of register variation on the web

Web register research: Next steps

• Comprehensive linguistic description of the patterns of register variation on the web

• A new multi-dimensional analysis of web registers• Detailed linguistic descriptions of ‘unique’ web

registers• Automatic prediction of register (‘AGI’)• Automatically coded large corpus of web documents• Extend descriptions to include ‘private’ web registers

Areas for future user-based research• Register classification of printed texts• Reader/listener perceptions• Corpus annotation• Word sense disambiguation

5. The future of crowdsourcing in user-based linguistics• User-based analyses have always happened; now

we can do them in a more valid way using crowdsourcing

• Triangulating use-based linguistic data offers a more complete understanding of discourse

• Linguists are often unable to fully analyze and interpret patterns in use-based datasets, particularly those that are very large

• Harnessing the power of user-based data via crowdsourcing could help us tackle big, difficult problems in linguistics

Mechanical Turk• The name comes from an 18th century machine that

played chess.• A person actually hid inside and played

Mechanical Turk• Amazon's Mechanical Turk is a crowdsourcing tool.• Researchers who need human evaluation can get

data• People who want to make some money help with

the project (less than minimum wage)– Image recognition– Speech processing– Subjective evaluation– Giving opinions– Tagging corpora– Match picture with product

Mechanical Turk• Example: word sense disambiguation in corpora

– What should head be tagged as? Noun or verb?– What does head mean in a sentence?

• They charged the head of finances with the crime. (person with office)

• The beer was flat with no head. (froth)• They were going head first (manner of

movement)• Computers can't do it well but people can

How does it work?

Couldn't people cheat? After reviewing results the requester can

reject a worker When rejected, they don't get paid Workers have approval rates Requesters can choose only workers with

good rates

Advantages Thousands of potential workers available You can get results fast Demographic variety (not just undergrads) Cheap (average $1.40 per hour)

Disadvantages Cheating

Some studies show it's at same rates as in lab

Ways to test “While exercising how often have you had a

fatal heart attack?” It requires money Can't do many types of experiments (RT)

Go look at it

Mechanical Turk website

https://www.mturk.com/mturk/welcome