harnessing the intelligence of the crowd for problem solving and knowledge … · 2020-07-07 ·...

Harnessing the Intelligence of the Crowd forProblem Solving and Knowledge Discovery

Jon ChamberlainSchool of Computer Science and Electronic Engineering

University of EssexWivenhoe Park, Colchester CO4 3SQ, England

[email protected]

Abstract

Crowdsourcing has established itself in the mainstreamof research methodology in recent years using a varietyof methods to engage humans to solve problems thatcomputers, as yet, cannot solve. Some approaches arerequired to incentivise the participation of users, otherapproaches have self-motivated users but present differ-ent challenges. This research investigates the user moti-vations behind crowdsourcing and human computationby looking at microworking, games-with-a-purpose andgroupsourcing approaches in terms of quality, quantityand other factors that determine their utility.Keywords: games-with-a-purpose, social networks,crowd intelligence

IntroductionCrowdsourcing has established itself in the mainstream ofresearch methodology in recent years using a variety ofmethods to engage humans to solve problems that comput-ers, as yet, cannot solve. Whilst the concept of human com-putation (von Ahn 2006) goes some way towards solvingproblems, it also introduces new challenges for researchers,not least how to deal with human psychology. Issues of par-ticipant recruitment and incentivisation are significant andmany projects do not live up to expectations because humaneffort cannot be acquired in the same way as machines.

The motivation of the PhD research comes from devel-oping and analysing data from a crowdsourced text annota-tion game. Whilst the game was comparatively successful interms of quantity and quality of data it still suffered numer-ous bottlenecks in data collection. In order to address theseissues a second version of the game was deployed on Face-book to make use of the existing network of users and thisachieved better results.

Based on these findings the PhD research focused on ar-eas where users were already doing tasks using social net-works without a central structure. By applying crowdsourc-ing methods to aggregate the data more meaningful visu-alisations and methods of knowledge discovery could beachieved.

Copyright c© 2014, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Related WorkThree common variations of collaboration over the Internetthat have been successful can be distinguished by the moti-vations of the participants.

In the first approach the motivation for the users to par-ticipate already exists. This could be because the user is in-herently interested in contributing, such as Wikipedia (Nov2007), or because users need to accomplish a different task,for example the reCAPTCHA authentication system (vonAhn et al. 2008).

Many human computation tasks are neither interesting noreasy to integrate into another system, so a second approachto crowdsourcing called microworking was developed, forexample Amazon Mechanical Turk (Kittur, Chi, and Suh2008). Participants are paid small amounts of money to com-plete HITs (Human Intelligence Tasks). Simple tasks can becompleted very quickly (Snow et al. 2008), however this ap-proach cannot be scaled up for large data collection effortsdue to the cost. Issues of ethics and workers’ rights have alsobeen raised (Fort, Adda, and Cohen 2011).

A third approach is to entertain the user whilst theycomplete tasks, typically using games or gamification. Thepurposeful games or games-with-a-purpose (GWAP) ap-proach has been used for many different types of crowd-sourced data collection including text, image, video andaudio annotation, biomedical applications, transcription,search and social bookmarking (Chamberlain et al. 2013).

These approaches can be seen as examples of the broadterm collective intelligence (Malone, Laubacher, and Del-larocas 2009).

Projects that do not have the budget to recruit users on alarge scale are reliant on accessing existing user groups. So-cial networks such as Facebook, LinkedIn and Flickr offeraccess to large user communities through integrated soft-ware applications and/or a back-end API. As social net-works mature the software is utilised in different ways, withdecentralised and unevenly distributed organisation of con-tent, similar to how Wikipedia users create pages of dictio-nary content. Citizen science, where members of the pub-lic contribute knowledge to scientific endeavours, is an es-tablished predecessor of crowdsourcing and social networkshave been successfully used to connect professional scien-tists with amateur enthusiasts (Sidlauskas et al. 2011). So-cial networks are self-organized and decentralized; tasks are

created by the users, so they are intrinsically motivated toparticipate, and the natural language of the interface allowsthem to express their emotions and frustrations whilst solv-ing tasks.

Groupsourcing, where a task is completed using a groupof intrinsically motivated people of varying expertise con-nected through a social network, is an effective method ofhuman computation in some domains (Chamberlain 2014).Users solve problems with high accuracy, educate each otherand share novel information and ideas. Contribution to sci-ence, learning and discovery are the driving motivations be-hind citizen science participation (Raddick et al. 2013).

ResearchThis PhD research looks at entity classification in text doc-uments (Wikipedia articles and fiction from Project Guten-berg1) and images from social networks in the marine biol-ogy domain.

In the case of text documents entity classification isthrough anaphora resolution, the semantic task concernedwith recognizing that, e.g., the pronoun it and the definitenominal the town refer to the same entity as the proper nameWivenhoe, and to a different entity from the mentions Colch-ester or River Colne.

Wivenhoe developed as a port and until the late 19thcentury was effectively a port for Colchester, as largeships were unable to navigate any further up the RiverColne, and had two prosperous shipyards. It became animportant port for trade for Colchester and developedshipbuilding and fishing industries. The period of great-est prosperity for the town came with the arrival of therailway in 1863.2

Anaphora resolution is a key semantic task both froma linguistic perspective and for applications ranging fromsummarisation to text mining.

Entity recognition is also performed in image classifica-tion (where objects in an image are identified). In this casethe annotations are open (can be any text) and apply to thewhole image. Region annotation, where parts of an imageare annotated, is more complex.

The power of mobilising a crowd to examine images on alarge scale was pioneered by the search for sailor and com-puter scientist Jim Gray in 20073 and most recently seenwith the disappearance of Malaysia Airlines flight MH370in 2014.4 Millions of users analysed satellite imagery, tag-ging anything that looked like wreckage, life rafts and oilslicks, with interesting images being passed on to experts.

Some citizen science projects get members of the publicto classify objects in images taken from ROVs (Remotely

1http://www.gutenberg.org2Taken from Wikipedia’s page about Wivenhoe, the village next

to the University of Essex.3http://www.wired.com/techbiz/people/magazine/15-

08/ff jimgray4http://www.tomnod.com/nod/challenge/mh370 indian ocean

Operated Vehicles)5 6 7, whilst others require the users tosupply the source data as well as the classification.8 9 10

The latter has been less active due to technical constraints(the users need to be trained in SCUBA diving and have un-derwater photographic equipment) but empowered users tohave their images identified by experts and contribute to sci-entific endeavours. The quality of citizen scientist generateddata has been shown to be comparable to that generated byexperts when producing taxonomic lists (Holt et al. 2013)even when the task is not trivial (He, van Ossenbruggen, andde Vries 2013).

In both types of task an occurrence of an entity is identi-fied that can be used to build an entity concept. This researchinvestigates several themes that are essential for harnessingthe intelligence of the crowd to complete this task.

User participationOne of the most significant failings of human computationsystems is the lack of participation from users. Incentives arecommonly divided into personal, social and financial cate-gories but what is most the most effective motivator for aparticular task?

Task designThe interface design needs to be appropriate for the task, theintended users and level of difficulty but how should dataquantity and quality be balanced? Is it better to collect morenoisy data or less higher quality data? How will the task dif-ficulty affect the users? Will they rise to the challenge ofdifficult tasks or be put off because the incentives are notsufficient?

Assessing users and dataAssessing users and the data they contribute is a key partof this research. Attention slips, malicious input and poorlytrained users need to be differentiated from genuinely am-biguous data and this can be done at the point of data entryor in post-processing. Without a known set of answers tojudge users by it may be necessary to rely on user reactiontime (Chamberlain and O’Reilly 2014) or modelling theirbehaviour over time (Passonneau and Carpenter 2013).

Aggregating dataOnce the data has been collected it needs to be aggregatedto produce a resource with a set of answers to the tasks. Thisdata can then be compared with traditional methods of com-pleting the tasks to assess the utility of the crowdsourcingapproaches in terms of time, complexity and financial in-vestment.

5http://www.planktonportal.org6http://www.seafloorexplorer.org7http://www.subseaobservers.com8http://www.projectnoah.org9http://www.arkive.org

10http://www.brc.ac.uk/irecord

Figure 1: Screenshot of Phrase Detectives on Facebook.

MethodologyThe PhD research investigates 3 approaches for problemsolving and knowledge discovery:

Text annotation with a gamePhrase Detectives11 is a single-player game-with-a-purposedeveloped to collect data about anaphora and is centredaround a detective metaphor (Chamberlain, Poesio, and Kru-schwitz 2008). The game architecture is articulated around anumber of tasks and uses scoring, progression and a varietyof other mechanisms to make the activity enjoyable. A mix-ture of incentives, from the personal (scoring, levels) to thesocial (competing for some players, participating in a worth-while enterprise for others) to the financial (small prizes) areemployed. This approach was adopted not just to annotatelarge amounts of text, but also to collect a large number ofjudgements about each linguistic expression.

The Facebook version of Phrase Detectives12, launchedin February 2011, maintained the overall game architecturewhilst incorporating a number of new features developedspecifically for the social network platform. The game wasdeveloped in PHP SDK (a Facebook API language allowingaccess to user data, friend lists, wall posting etc) and inte-grates seamlessly within the Facebook site.

The game uses 2 styles of text annotation for players tocomplete a linguistic task. Initially text is presented in Anno-tation Mode (called Name the Culprit in the game). This is astraightforward annotation mode where the player makes anannotation decision about a highlighted markable (section oftext). If different players enter different interpretations for amarkable then each interpretation is presented to more play-ers in Validation Mode (called Detectives Conference in thegame). The players in Validation Mode have to agree or dis-agree with the interpretation.

11https://www.phrasedetectives.com12https://apps.facebook.com/phrasedetectives

Figure 2: Detail of a typical message from Facebook con-taining an image classification task having been analysed fornamed entities.

Crowdsourcing method AccuracyGroupsourcing (test set) 0.93Crowdflower (training) @ $0.05 n=10 0.91Crowdflower (test set) @ $0.05 n=10 0.49

Table 1: Comparison of image classification accuracy be-tween groupsourcing and microworking.

Image classification with groupsourcingFacebook has a vast resource of uploaded images from itscommunity of users, with over 250 billion images, and a fur-ther 350 million posted every day. Images of things (ratherthan people or places) that have been given captions by usersonly represents 1% of this data, but it is still of the order of2.6 billion images.13

The accuracy of the image tags has been shown to bevery high in some domains (Chamberlain 2014) howeverautomatically aggregating the data is not trivial. Ontologies,gazetteers or controlled vocabularies can be used to structurethe content. This research uses, in the first instance, an on-tology of marine species14 as a hierarchical list of named en-tities. Each chunk of text from a message thread is scannedfor named entities from the ontology and an index table iscreated in a MySQL database - see Figure 2.

The named entity index is used to aggregate the messagesallowing a user to find all content regarding a particular ma-rine species and other species that are associated with it(what it eats, what it looks similar to, etc.). Additionally,messages containing a named entity with an image attachedare used to create a gallery of photographic examples of thespecies.

Microworking on CrowdflowerThere has been considerable research into user behaviour onAmazon’s Mechanical Turk and Crowdflower and it has be-come a default research tool for collecting small quantitiesof research data. Text and image data from the above ap-proaches have been annotated using Crowdflower in orderto benchmark the task difficulty using the same interfacefor different media (see Table 1 for a comparison of Crowd-flower with Groupsourcing).

13http://www.insidefacebook.com/2013/11/28/infographic-what-types-of-images-are-posted-on-facebook

14http://www.marinespecies.org (Sept 2012)

Figure 3: Chart showing the ranked scores of players fromPhrase Detectives on Facebook.

ChallengesThere are numerous challenges in this area of research.

Application of incentivesIt is increasingly difficult to attract the interest of users how-ever the principle of using personal, social or financial in-centives seem to still apply. New or adapted strategies aresome of the most discussed topics amongst the developersof human computation systems.

Workload distributionStudies of user contribution in Phrase Detectives show thatthe ten highest scoring players (representing 1.3% of to-tal players) had 60% of the total points on the system andhad made 73% of the annotations (Chamberlain, Poesio, andKruschwitz 2009). In the Facebook version of the game theten highest scoring players (representing 1.6% of total play-ers) had 89% of the total points and had made 89% of theannotations (see Figure 3). A similar Zipfian distribution ofworkload is seen with the users of groupsourcing and otherapproaches.

These results show that the majority of the workload isbeing done by a handful of users. However, the influence ofusers who only contribute a little should not be undervaluedas in some systems it can be as high as 30% of the workload(Kanefsky, Barlow, and Gulick 2001) and this is what makesthe collective decision making robust.

Crowd homogeneityThe gender distribution of the active users of groupsourc-ing shows a distinct male bias in contrast to other types ofsocial network gaming (Chamberlain, Kruschwitz, and Poe-sio 2012), and Facebook generally, which is reported to havemore female users.15 Only 12% of contributors to Wikipediaare female (Glott, Schmidt, and Ghosh 2010), a statistic thatprompted significant research into the gender bias in the au-thorship of the site (Laniado et al. 2012).

15http://royal.pingdom.com/2009/11/27/study-males-vs-females-in-social-networks

Figure 4: Actual score per task from Phrase Detectives com-pared to a simulation model based on task difficulty, corporacompletion and average ability of the crowd.

It may be that groupsourcing is appealing in the same wayas Wikipedia, or perhaps males prefer image-based tasks toword-based problems to solve (Mason and Watts 2009), oreven that the topic investigated is a male dominated interest.The different homogeneity can have an impact on collectiveintelligence (Woolley et al. 2010).

Crowd-powered expertsA classification task using images of breast cancer showedreasonable accuracy from Crowdflower using a similar con-figuration, however an additional approach was to “crowd-power experts” by using crowdsourcing to deal with major-ity of the easy work and get experts to focus on the diffi-cult images (Eickhoff 2014). This accuracy is comparable towhat could be achieved by groupsourcing and could be con-sidered a similar scenario where the majority of group userstake on the bulk of the work solving easy tasks leaving theexperts to focus on what is of most interest to them. How-ever, the distinction between experts and non-experts in thecrowd may not be clear cut (Brabham 2012).

Automatic processingThe pre-processing of text used for Phrase Detectives wasreasonably accurate but there were errors that had to be man-ually corrected that, given the size of the corpus, took con-siderable time. A significant challenge for groupsourcingas a methodology is the automatic processing the threads.There is a large quantity of data associated with threads andremoving this overhead is essential when processing on alarge scale. The natural language processing needs to copewith ill-formed grammar and spelling, and sentences whereonly previous context could make sense of the meaning, forexample:

“And my current puzzle ...”“Need assistance with this tunicate please.”“couldn’t find an ID based on these colours”

Aggregating dataCurrently all the methodologies used in the research use amajority voting aggregation to produce the best answer (see

Figure 4) however sophisticated crowd aggregation tech-niques (Raykar et al. 2010) could be used to gauge the con-fidence of data extracted from threads on a large scale.

ReferencesBrabham, D. C. 2012. The myth of amateur crowds. Infor-mation, Communication and Society 15(3):394–410.Chamberlain, J., and O’Reilly, C. 2014. User performanceindicators in task-based data collection systems. In Proceed-ings of MindTheGap’14.Chamberlain, J.; Fort, K.; Kruschwitz, U.; Mathieu, L.; andPoesio, M. 2013. Using games to create language re-sources: Successes and limitations of the approach. In ACMTransactions on Interactive Intelligent Systems, volume ThePeople’s Web Meets NLP: Collaboratively Constructed Lan-guage Resources. Springer.Chamberlain, J.; Kruschwitz, U.; and Poesio, M. 2012. Mo-tivations for participation in socially networked collectiveintelligence systems. In Proceedings of CI’12.Chamberlain, J.; Poesio, M.; and Kruschwitz, U. 2008.Phrase Detectives: A web-based collaborative annotationgame. In Proceedings of the International Conference onSemantic Systems (I-Semantics’08).Chamberlain, J.; Poesio, M.; and Kruschwitz, U. 2009. Anew life for a dead parrot: Incentive structures in the PhraseDetectives game. In Proceedings of the WWW 2009 Work-shop on Web Incentives (WEBCENTIVES’09).Chamberlain, J. 2014. Groupsourcing: Distributed prob-lem solving using social networks. In Proceedings ofHCOMP14.Eickhoff, C. 2014. Crowd-powered experts: Helping sur-geons interpret breast cancer images. In Proceedings ofGamifIR’14.Fort, K.; Adda, G.; and Cohen, K. B. 2011. Amazon Me-chanical Turk: Gold mine or coal mine? Computational Lin-guistics (editorial) 37:413–420.Glott, R.; Schmidt, P.; and Ghosh, R. 2010. Wikipedia sur-vey – Overview of results. UNU-MERIT 1–11.He, J.; van Ossenbruggen, J.; and de Vries, A. P. 2013. Doyou need experts in the crowd?: A case study in image an-notation for marine biology. In Proceedings of OAIR’13,57–60.Holt, B. G.; Rioja-Nieto, R.; MacNeil, A. M.; Lupton, J.;and Rahbek, C. 2013. Comparing diversity data collectedusing a protocol designed for volunteers with results from aprofessional alternative. Methods in Ecology and Evolution4(4):383–392.Kanefsky, B.; Barlow, N.; and Gulick, V. 2001. Can dis-tributed volunteers accomplish massive data analysis tasks?Lunar and Planetary Science XXXII.Kittur, A.; Chi, E. H.; and Suh, B. 2008. Crowdsourcing userstudies with Mechanical Turk. In Proceedings of CHI’08,453–456.Laniado, D.; Castillo, C.; Kaltenbrunner, A.; and Fuster-Morell, M. 2012. Emotions and dialogue in a peer-

production community: The case of Wikipedia. In Proceed-ings of WikiSym’12.Malone, T.; Laubacher, R.; and Dellarocas, C. 2009. Har-nessing crowds: Mapping the genome of collective intel-ligence. Research Paper No. 4732-09, Sloan School ofManagement, Massachusetts Institute of Technology, Cam-bridge, MA, USA.Mason, W., and Watts, D. J. 2009. Financial incentivesand the “performance of crowds”. In Proceedings of KDDworkshop HCOMP’09.Nov, O. 2007. What motivates Wikipedians? Communica-tions of the ACM 50(11):60–64.Passonneau, R. J., and Carpenter, B. 2013. The benefits ofa model of annotation. In Proceedings of the 7th LinguisticAnnotation Workshop and Interoperability with Discourse,187–195. Association for Computational Linguistics.Raddick, M. J.; Bracey, G.; Gay, P. L.; Lintott, C. J.; Car-damone, C.; Murray, P.; Schawinski, K.; Szalay, A. S.; andVandenberg, J. 2013. Galaxy Zoo: Motivations of citizenscientists. ArXiv e-prints.Raykar, V. C.; Yu, S.; Zhao, L. H.; Valadez, G. H.; Florin,C.; Bogoni, L.; and Moy, L. 2010. Learning from crowds.Journal of Machine Learning Research 11:1297–1322.Sidlauskas, B.; Bernard, C.; Bloom, D.; Bronaugh, W.;Clementson, M.; and Vari, R. P. 2011. Ichthyologists hookedon Facebook. Science 332(6029):537.Snow, R.; O’Connor, B.; Jurafsky, D.; and Ng, A. Y. 2008.Cheap and fast - but is it good?: Evaluating non-expert an-notations for natural language tasks. In Proceedings ofEMNLP’08.von Ahn, L.; Maurer, B.; McMillen, C.; Abraham, D.;and Blum, M. 2008. reCAPTCHA: Human-based char-acter recognition via web security measures. Science321(5895):1465–1468.von Ahn, L. 2006. Games with a purpose. Computer39(6):92–94.Woolley, A. W.; Chabris, C. F.; Pentland, A.; Hashmi, N.;and Malone, T. W. 2010. Evidence for a collective intelli-gence factor in the performance of human groups. Science330:686–688.

harnessing the intelligence of the crowd for problem solving and knowledge … · 2020-07-07 ·...

Documents