crowdsourcing€¦ · crowdsourcing • crowdsourcing is a process where work is outsourced to an...
TRANSCRIPT
Crowdsourcing
Sasha Mile Rudan
Crowdsourcing • Crowdsourcing is a process where work is
outsourced to an undefined group of people • Jeff Howe and Mark Robinson, Wired Magazine
• often been used in the past as a competition in order to discover a solution
• Micro-tasks, repetition, learning, fatigue • Collaborative human interpreter
• first proposed as a programming language on the blog Google Blogoscoped (Philipp Lenssen)
Mechanical Turk • Automaton Chess Player • a fake chess-playing
machine constructed in the late 18th century
• Human “supported”/ “augmented” machine
Amazon Mechanical Turk (1/2) • Collaborative human interpreter
• first proposed as a programming language on the blog Google Blogoscoped (Philipp Lenssen)
• HIT: A Human Intelligent Task • Assignment
• Every HIT can be replicated into multiple assignments • enable majority voting for quality assurance
• HIT Group: • AMT automatically groups similar HITs together
Amazon Mechanical Turk (2/2) • AMT Workflow: requester, workers (turks) • 2 kinds of user interface for workers • Relationship & Reputation • Workers has online-forums about requesters • AMT has REST APIs • Applications: Computational difficult tasks, Missing persons
searches, Social science experiments, Artistic & educational research
Social Turing Tests: Crowdsourcing Sybil Detection
• Feasibility of a crowdsourced Sybil detection system for OSNs
• Sybil, psychiatric case of Shirley Mason who reportedly had 16 personalities
• Detection accuracy (experts, turkers) • Scalability, efficiency
Automated techniques of detection • SybilRank, … • difficulty friending legitimate users
• Own communities • community detection techniques, social graph
• New malicious users generations • infiltrate communities of legitimate users • copying profile data, recruiting real users to
customize them, crowdsourced, paid, anti-CAPTCHA
Ground-truth data collection • Renren
• Sybils directly from Renren Inc • Zhubajie, crowdsourcing platform, instead of AMT
• Facebook USA, Facebook India • Sybil accounts by crawling
• Legitimate profiles: seeding from 4 trusted friends => friends-of-friends => unknown
Social-Turing tests • Social-Turing tests are resilient to
changing attacker strategies • Crowdsourcing is much cheaper than hiring
full-time content moderators • How accurate are users at distinguishing between real
and fake profiles? Are there demographic factors? Does survey fatigue impact detection accuracy? Is crowdsourced Sybil detection cost effective?
Datasets
User Study • User questionnaire • 2 languages • no malicious testers • Wall and photos are
under • Chance to change
the previous answer
Demographic
Testers’ accuracy • Individual accuracy
• Chinese and Indian turkers perform the worst, with half achieving ≤65% accuracy
False positive and False negative • False positives are uniformly lower than
false negatives
Improvements, majority voting • FN error increase with adding more turkers
Reasons for Suspicion
Turker Accuracy Analysis • Demographic factors
• Previous slide with demography
• Temporal Factors and Survey Fatigue
Turker Selection • can we continue to improve the overall
accuracy of turkers by simply adding more of them
• Filtering Inaccurate Turkers • pre-screening test
Profile Difficulty • Are there extremely difficult “stealth” • Turker luck?
System design
• One-layer schema • Two-layer schema • Simulation • Privacy
CrowdDB: Answering Queries with Crowdsourcing
• human input for providing information that is missing from the database
• computationally difficult functions, matching, ranking, or aggregating results based on fuzzy criteria
• Closed-world assumption • Amazon Mechanical Turk
RDBMS • Limitations of the technology are becoming
more apparent • Assumptions about the correctness,
completeness and unambiguity of the data they store • When these assumptions fail to hold, relational
systems will return incorrect or incomplete answers
• SELECT market_capitalization FROM company WHERE name = "I.B.M.";
• Typing error • inadvertently deleted, entered incorrectly • Multiple ways to refer to the same real-
world entity
• SELECT image FROM picture WHERE topic = "Business Success” ORDER BY relevance LIMIT 1;
• could easily be answered by people
AMT • Relatively Small Worker Pool