Crowdsourcing Approaches to Big Data Curation for Earth Sciences

Download Crowdsourcing Approaches to Big Data Curation for Earth Sciences

Post on 11-Aug-2014

603 views

Category:

Data & Analytics

19 download

DESCRIPTION

 

TRANSCRIPT

EarthBiAs2014 Global NEST University of the Aegean Crowdsourcing Approaches to Big Data CuraDon for Earth Sciences Insight Centre for Data AnalyDcs, NaDonal University of Ireland Galway EarthBiAs2014 1 Take Home Algorithms Humans Better DataData Talk Overview Part I: Mo=va=on Part II: Data Quality And Data Cura=on Part III: Crowdsourcing Part IV: Case Studies on Crowdsourced Data Cura=on Part V: SeIng up a Crowdsourced Data Cura=on Process Part VI: Linked Open Data Example Part IIV: Future Research Challenges 7-11 July 2014, Rhodes, Greece EarthBiAs2014 MOTIVATION PART I 7-11 July 2014, Rhodes, Greece EarthBiAs2014 BIG Big Data Public Private Forum THE BIG PROJECT Overall objective Bringing the necessary stakeholders into a self-sustainable industry-led initiative, which will greatly contribute to enhance the EU competitiveness taking full advantage of Big Data technologies. Work at technical, business and policy levels, shaping the future through the positioning of IIM and Big Data specifically in Horizon 2020. BIGBig Data Public Private Forum BIG Big Data Public Private Forum SITUATING BIG DATA IN INDUSTRY Health Public Sector Finance & Insurance Telco, Media& Entertainment Manufacturing, Retail, Energy, Transport Needs Offerings Value Chain Technical Working Groups Industry Driven Sectorial Forums Data Acquisition Data Analysis Data Curation Data Storage Data Usage Structured data Unstructured data Event processing Sensor networks Protocols Real-time Data streams Multimodality Stream mining Semantic analysis Machine learning Information extraction Linked Data Data discovery Whole world semantics Ecosystems Community data analysis Cross-sectorial data analysis Data Quality Trust / Provenance Annotation Data validation Human-Data Interaction Top-down/Bottom-up Community / Crowd Human Computation Curation at scale Incentivisation Automation Interoperability In-Memory DBs NoSQL DBs NewSQL DBs Cloud storage Query Interfaces Scalability and Performance Data Models Consistency, Availability, Partition- tolerance Security and Privacy Standardization Decision support Prediction In-use analytics Simulation Exploration Visualisation Modeling Control Domain-specific usage BIG Big Data Public Private Forum SUBJECT MATTER EXPERT INTERVIEWS BIG Big Data Public Private Forum KEY INSIGHTS Key Trends Lower usability barrier for data tools Blended human and algorithmic data processing for coping with for data quality Leveraging large communities (crowds) Need for semantic standardized data representation Significant increase in use of new data models (i.e. graph) (expressivity and flexibility) Much of (Big Data) technology is evolving evolutionary But business processes change must be revolutionary Data variety and verifiability are key opportunities Long tail of data variety is a major shift in the data landscape The Data Landscape Lack of Business-driven Big Data strategies Need for format and data storage technology standards Data exchange between companies, institutions, individuals, etc. Regulations & markets for data access Human resources: Lack of skilled data scientists Biggest Blockers Technical White Papers available on: http://www.big-project.eu EarthBiAs2014 7-11 July 2014, Rhodes, Greece The Internet of Everything: Connecting the Unconnected EarthBiAs2014 7-11 July 2014, Rhodes, Greece Earth Science Systems of Systems EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece CiDzen Sensors humans as ci,zens on the ubiquitous Web, ac,ng as sensors and sharing their observa,ons and views Sheth, A. (2009). Ci=zen sensing, social signals, and enriching human experience. Internet Compu,ng, IEEE, 13(4), 87-92. Air Pollution EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece Citizens as Sensors EarthBiAs2014 7-11 July 2014, Rhodes, Greece 16 of XYZ Haklay, M., 2013, Citizen Science and Volunteered Geographic Information overview and typology of participation in Sui, D.Z., Elwood, S. and M.F. Goodchild (eds.), 2013. Crowdsourcing Geographic Knowledge: Volunteered Geographic Information (VGI) in Theory and Practice . Berlin: Springer. DATA QUALITY AND DATA CURATION PART II 7-11 July 2014, Rhodes, Greece EarthBiAs2014 EarthBiAs2014 7-11 July 2014, Rhodes, Greece The Problems with Data Knowledge Workers need: Access to the right data Confidence in that data Flawed data effects 25% of critical data in worlds top companies Data quality role in recent financial crisis: Asset are defined differently in different programs Numbers did not always add up Departments do not trust each others figures Figures not worth the pixels they were made of EarthBiAs2014 7-11 July 2014, Rhodes, Greece What is Data Quality? Desirable characteristics for information resource Described as a series of quality dimensions: n Discoverability & Accessibility: storing and classifying in appropriate and consistent manner n Accuracy: Correctly represents the real-world values it models n Consistency: Created and maintained using standardized definitions, calculations, terms, and identifiers n Provenance & Reputation: Track source & determine reputation Includes the objectivity of the source/producer Is the information unbiased, unprejudiced, and impartial? Or does it come from a reputable but partisan source? Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33. EarthBiAs2014 7-11 July 2014, Rhodes, Greece Data Quality ID PNAME PCOLOR PRICE APNR iPod Nano Red 150 APNS iPod Nano Silver 160 150 5 Source A Source B Schema Difference? Data Developer APNR iPod Nano Red 150 APNR iPod Nano Silver 160 iPod Nano IPN890 150 5 Value Conflicts? Entity Duplication? Data Steward Business Users ? Technical Domain (Technical) Domain EarthBiAs2014 7-11 July 2014, Rhodes, Greece What is Data Curation? n Digital Curation Selection, preservation, maintenance, collection, and archiving of digital assets n Data Curation Active management of data over its life-cycle n Data Curators Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use Museum cataloguers of the Internet age EarthBiAs2014 7-11 July 2014, Rhodes, Greece Related Activities n Data Governance/ Master Data Management Convergence of data quality, data management, business process management, and risk management Part of overall data governance strategy for organization n Data Curator = Data Steward EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation n Multiple approaches to curate data, no single correct way Who? Individual Curators Curation Departments Community-based Curation How? Manual Curation (Semi-)Automated Sheer Curation EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation Who? n Individual Data Curators Suitable for infrequently changing small quantity of data (million records) Availability: Post-hoc nature creates delay in curated data availability EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation - Who? n Community-Based Data Curation Decentralized approach to data curation Crowd-sourcing the curation process Leverages community of users to curate data Wisdom of the community (crowd) Can scale to millions of records EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation How? n Manual Curation Curators directly manipulate data Can tie users up with low-value add activities n (Sem-)Automated Curation Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification Can be supervised or approved by human curators EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation How? n Sheer curation, or Curation at Source Curation activities integrated in normal workflow of those creating and managing data Can be as simple as vetting or rating the results of a curation algorithm Results can be available immediately n Blended Approaches: Best of Both Sheer curation + post hoc curation department Allows immediate access to curated data Ensures quality control with expert curation EarthBiAs2014 7-11 July 2014, Rhodes, Greece Data Quailty Data Curation Example Profile Sources Define Mappings Cleans Enrich De-duplicate Define Rules Curated Data Data Developer Data Curator Data Governance Business Users Applications Product DataProduct Data EarthBiAs2014 7-11 July 2014, Rhodes, Greece Data Curation n Pros Can create a single version of truth Standardized information creation and management Improves data quality n Cons Significant upfront costs and efforts Participation limited to few (mostly) technical experts Difficult to scale for large data sources Extended Enterprise e.g. partner, data vendors Small % of data under management (i.e. CRM, Product, ) EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times 100 Years of Expert Data Curation EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Largest metropolitan and third largest newspaper in the United States n nytimes.com q Most popular newspaper website in US n 100 year old curated repository defining its participation in the emerging Web of Data EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Data curation dates back to 1913 Publisher/owner Adolph S. Ochs decided to provide a set of additions to the newspaper n New York Times Index Organized catalog of articles titles and summaries Containing issue, date and column of article Categorized by subject and names Introduced on quarterly then annual basis n Transitory content of newspaper became important source of searchable historical data Often used to settle historical debates EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Index Department was created in 1913 Curation and cataloguing of NYT resources Since 1851 NYT had low quality index for internal use n Developed a comprehensive catalog using a controlled vocabulary Covering subjects, personal names, organizations, geographic locations and titles of creative works (books, movies, etc), linked to articles and their summaries n Current Index Dept. has ~15 people EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Challenges with consistently and accurately classifying news articles over time Keywords expressing subjects may show some variance due to cultural or legal constraints Identities of some entities, such as organizations and places, changed over time n Controlled vocabulary grew to hundreds of thousands of categories Adding complexity to classification process EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Increased importance of Web drove need to improve categorization of online content n Curation carried out by Index Department Library-time (days to weeks) Print edition can handle next-day index n Not suitable for real-time online publishing nytimes.com needed a same-day index EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Introduced two stage curation process Editorial staff performed best-effort semi- automated sheer curation at point of online pub. Several hundreds journalists Index Department follow up with long-term accurate classification and archiving n Benefits: Non-expert journalist curators provide instant accessibility to online users Index Department provides long-term high- quality curation in a trust but verify approach EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Curation starts with article getting out of the newsroom EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Member of editorial staff submits article to web-based rule based information extraction system (SAS Teragram) EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Teragram uses linguistic extraction rules based on subset of Index Depts controlled vocab. EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Teragram suggests tags based on the Index vocabulary that can potentially describe the content of article EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Editorial staff member selects terms that best describe the contents and inserts new tags if necessary EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Reviewed by the taxonomy managers with feedback to editorial staff on classification process EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Article is published online at nytimes.com EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow At later stage article receives second level curation by Index Dept. additional Index tags and a summary EarthBiAs2014 7-11 July 2014, Rhodes, Greece NYT Curation Workflow Article is submitted to NYT Index EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Early adopter of Linked Open Data (June 09) EarthBiAs2014 7-11 July 2014, Rhodes, Greece The New York Times n Linked Open Data @ data.nytimes.com Subset of 10,000 tags from index vocabulary Dataset of people, organizations & locations Complemented by search services to consume data about articles, movies, best sellers, Congress votes, real estate, n Benefits Improves traffic by third party data usage Lowers development cost of new applications for different verticals inside the website E.g. movies, travel, sports, books CROWDSOURCING PART III 7-11 July 2014, Rhodes, Greece EarthBiAs2014 EarthBiAs2014 7-11 July 2014, Rhodes, Greece 50 Crowdsourcing Landscape EarthBiAs2014 7-11 July 2014, Rhodes, Greece Crowdsourcing Landscape 51 EarthBiAs2014 7-11 July 2014, Rhodes, Greece Introduction to Crowdsourcing n Coordinating a crowd (a large group of workers)to do micro-work (small tasks) that solves problems (that computers or a single user cant) n A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve goals n Related Areas Collective Intelligence Social Computing Human Computation Data Mining A. J. Quinn and B. B. Bederson, Human computation: a survey and taxonomy of a growing field, in Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp. 14031412. EarthBiAs2014 7-11 July 2014, Rhodes, Greece When Computers Were Human n Maskelyne 1760 Used human computers to created almanac of moon positions Used for shipping/ navigation Quality assurance Do calculations twice Compare to third verifier D. A. Grier, When Computers Were Human, vol. 13. Princeton University Press, 2005. EarthBiAs2014 7-11 July 2014, Rhodes, Greece When Computers Were Human EarthBiAs2014 7-11 July 2014, Rhodes, Greece 20th Century 1926: Teleautomation when wireless is perfectly applied the whole earth will be converted into a huge brain. 1948: Cybernetics communication and control theory that is concerned especially with the comparative study of automatic control systems. Credits: Thierry Ehrmann (Flickr), Dr. Sabina Jeschke, Wikimedia Foundation 1961: Embedded systems A system a dedicated function within a larger mechanical or electrical system, often with real-time computing constraints. 1988: Ubiquitous computing advanced computing concept where computing is made to appear everywhere and anywhere. EarthBiAs2014 7-11 July 2014, Rhodes, Greece 21st Century Credits: Kevin Ashton, Amith Sheth, Helen Gill, Wikimedia Foundation 1999: Internet of Things to uniquely identifiable objects and their virtual representations in an Internet-like structure. 2006: Cyber-physical systems communication and control theory that is concerned especially with the comparative study of automatic control systems. 2008: Web of Things A set of blueprints to make every-day physical objects first class citizens of the World Wide Web by giving them an API. 2012: Physical-Cyber-Social computing a holistic treatment of data, information, and knowledge from the PCS worlds to integrate, correlate, interpret, and provide contextually relevant abstractions to humans. EarthBiAs2014 7-11 July 2014, Rhodes, Greece Sensing Credits: Albany Associates, stuartpilrow, Mike_n (Flickr) Computation Actuation Human Powered CPS Leverages human capabilities in conjunction with machine capabilities for optimizing processes in the cyber-physical-social environments EarthBiAs2014 7-11 July 2014, Rhodes, Greece Human Visual perception Visuospatial thinking Audiolinguistic ability Sociocultural awareness Creativity Domain knowledge Machine Large-scale data manipulation Collecting and storing large amounts of data Efcient data movement Bias-free analysis Human vs Machine Affordances R. J. Crouser and R. Chang, An affordance-based framework for human computation and human-computer collaboration, IEEE Trans. Vis. Comput. Graph., vol. 18, pp. 28592868, 2012. EarthBiAs2014 7-11 July 2014, Rhodes, Greece When to Crowdsource a Task? n Computers cannot do the task n Single person cannot do the task n Work can be split into smaller tasks EarthBiAs2014 7-11 July 2014, Rhodes, Greece Platforms and Marketplaces EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Crowds n Internal corporate communities Taps potential of internal workforce Curate competitive enterprise data that will remain internal to the company May not always be the case e.g. product technical support and marketing data n External communities Public crowd-souring market places Pre-competitive communities EarthBiAs2014 7-11 July 2014, Rhodes, Greece Generic Architecture Workers Platform/Marketplace (Publish Task, Task Management) Requestors 1. 2. 4. 3. EarthBiAs2014 7-11 July 2014, Rhodes, Greece Mturk Workflow EarthBiAs2014 7-11 July 2014, Rhodes, Greece Spatial Crowdsourcing n Crowdsoucring that requires a person to travel to a location to preform a spatial task Helps non-local requesters through workers in targeted spatial locality Used for data collection, package routing, citizen actuation Usually based on mobile applications Closely related to social sensing, participatory sensing, etc. Early example Ardavark social search en n Example systems CASE STUDIES ON CROWDSOURCED DATA CURATION PART IV 7-11 July 2014, Rhodes, Greece EarthBiAs2014 EarthBiAs2014 7-11 July 2014, Rhodes, Greece Crowdsourced Data Curation DQ Rules & Algorithms Entity Linking Data Fusion Relation Extraction Human Computation Relevance Judgment Data Verification Disambiguation Clean Data Internal Community - Domain Knowledge - High Quality Responses - Trustable Web of Data Databases Textual Content Programmers Managers External Crowd - High Availability - Large Scale - Expertise Variety EarthBiAs2014 7-11 July 2014, Rhodes, Greece Examples of CDM Tasks n Understanding customer sentiment for launch of new product around the world. n Implemented 24/7 sentiment analysis system with workers from around the world. n 90% accuracy in 95% on content n Categorize millions of products on eBays catalog with accurate and complete attributes n Combine the crowd with machine learning to create an affordable and flexible catalog quality system EarthBiAs2014 7-11 July 2014, Rhodes, Greece Examples of CDM Tasks n Natural Language Processing Dialect Identification, Spelling Correction, Machine Translation, Word Similarity n Computer Vision Image Similarity, Image Annotation/Analysis n Classification Data attributes, Improving taxonomy, search results n Verification Entity consolidation, de-duplicate, cross-check, validate data n Enrichment Judgments, annotation EarthBiAs2014 7-11 July 2014, Rhodes, Greece Wikipedia n Collaboratively built by large community More than 19,000,000 articles, 270+ languages, 3,200,000+ articles in English More than 157,000 active contributors n Accuracy and stylistic formality are equivalent to expert-based resources i.e. Columbia and Britannica encyclopedias n WikiMeida Software behind Wikipedia Widely used inside organizations Intellipedia:16 U.S. Intelligence agencies Wiki Proteins: curated Protein data for knowledge discovery EarthBiAs2014 7-11 July 2014, Rhodes, Greece Wikipedia n Decentralized environment supports creation of high quality information with: Social organization Artifacts, tools & processes for cooperative work coordination n Wikipedia collaboration dynamics highlight good practices EarthBiAs2014 7-11 July 2014, Rhodes, Greece Wikipedia Social Organization n Any user can edit its contents Without prior registration n Does not lead to a chaotic scenario In practice highly scalable approach for high quality content creation on the Web n Relies on simple but highly effective way to coordinate its curation process n Curation is activity of Wikipedia admins Responsibility for information quality standards EarthBiAs2014 7-11 July 2014, Rhodes, Greece Wikipedia Social Organization EarthBiAs2014 7-11 July 2014, Rhodes, Greece Wikipedia Social Organization n Incentives Improvement of ones reputation Sense of efficacy Contributing effectively to a meaningful project Over time focus of editors typically change From curators of a few articles in specific topics To more global curation perspective Enforcing quality assessment of Wikipedia as a whole EarthBiAs2014 7-11 July 2014, Rhodes, Greece Wikipedia Artifacts, Tools & Processes n Wiki Article Editor (Tool) WYSIWYG or markup text editor n Talk Pages (Tool) Public arena for discussions around Wikipedia resources n Watchlists (Tool) Helps curators to actively monitor the integrity and quality of resources they contribute n Permission Mechanisms (Tool) Users with administrator status can perform critical actions such as remove pages and grant administrative permissions to new users n Automated Edition (Tool) Bots are automated or semi-automated tools that perform repetitive tasks over content n Page History and Restore (Tool) Historical trail of changes to a Wikipedia Resource n Guidelines, Policies & Templates (Artifact) Defines curation guidelines for editors to assess article quality n Dispute Resolution (Process) Dispute mechanism between editors over the article contents n Article Edition, Deletion, Merging, Redirection, Transwiking, Archival (Process) Describe the curation actions over Wikipedia resources EarthBiAs2014 7-11 July 2014, Rhodes, Greece DBPedia Knowledge base n DBPedia provides direct access to data Indirectly uses wiki as data curation platform Inherits massive volume of curated Wikipedia data 3.4 million entities and 1 billion RDF triples Comprehensive data infrastructure Concept URIs Definitions Basic types EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece Wikipedia - DBPedia EarthBiAs2014 7-11 July 2014, Rhodes, Greece n Collaborative knowledge base maintained by community of web users n Users create entity types and their meta-data according to guidelines n Requires administrative approvals for schema changes by end users EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece Audio Tagging - Tag a Tune EarthBiAs2014 7-11 July 2014, Rhodes, Greece Image Tagging - Peekaboom EarthBiAs2014 7-11 July 2014, Rhodes, Greece Protein Folding - Fold.it/ EarthBiAs2014 7-11 July 2014, Rhodes, Greece ReCaptcha n OCR ~ 1% error rate 20%-30% for 18th and 19th century books n 40 million ReCAPTCHAs every day (2008) Fixing 40,000 books a day EarthBiAs2014 7-11 July 2014, Rhodes, Greece ChemSpider n Structure centric chemical community Over 300 data sources with 25 million records Provided by chemical vendors, government databases, private laboratories and individual n Pharma realizing benefits of open data Heavily leveraged by pharmaceutical companies as pre-competitive resources for experimental and clinical trial investigation Glaxo Smith Kline made its proprietary malaria dataset of 13,500 compounds available EarthBiAs2014 7-11 July 2014, Rhodes, Greece n Dedicated to improving understanding of biological systems functions with 3-D structure of macromolecules Started in 1971 with 3 core members Originally offered 7 crystal structures Grown to 63,000 structures Over 300 million dataset downloads n Expanded beyond curated data download service to include complex molecular visualized, search, and analysis capabilities SETTING UP A CROWDSOURCED DATA CURATION PROCESS PART V 7-11 July 2014, Rhodes, Greece EarthBiAs2014 EarthBiAs2014 7-11 July 2014, Rhodes, Greece Core Design Questions Goal What Why IncentivesWhoWorkers How Process Malone, T. W., Laubacher, R., & Dellarocas, C. N. Harnessing crowds: Mapping the genome of collective intelligence. MIT Sloan Research Paper 4732-09, (2009). EarthBiAs2014 7-11 July 2014, Rhodes, Greece Setting up a Curation Process 1 Who is doing it? 2 Why are they doing it? 3 What is being done? 4 How is it being done? EarthBiAs2014 7-11 July 2014, Rhodes, Greece 1) Who is doing it? (Workers) n Hierarchy (Assignment) Someone in authority assigns a particular person or group of people to perform the task Within the Enterprise (i.e. Individuals, specialised departments) Within a structured community (i.e. pre- competitive community) n Crowd (Choice) Anyone in a large group who choses to do so Internal or External Crowds EarthBiAs2014 7-11 July 2014, Rhodes, Greece 2) Why are they doing it? (Incentives) n Motivation Money ($$) Glory (reputation/prestige) Love (altruism, socialize, enjoyment) Unintended by-product (e.g. re-Captcha, captured in workflow) Self-serving resources (e.g. Wikipedia, product/customer data) Part of their job description (e.g. Data curation as part of role) n Determine pay and time for each task Marketplace: Delicate balance Money does not improve quality but can increase participation Internal Hierarchy: Engineering opportunities for recognition Performance review, prizes for top contributors, badges, leaderboards, etc. EarthBiAs2014 7-11 July 2014, Rhodes, Greece Effect of Payment on Quality n Cost does not affect quality n Similar results for bigger tasks [Ariely et al, 2009] Mason, W. A., & Watts, D. J. (2009). Financial incentives and the performance of crowds. Proceedings of the Human Computation Workshop. Paris: ACM, June 28, 2009. [Panos Ipeirotis. WWW2011 tutorial] EarthBiAs2014 7-11 July 2014, Rhodes, Greece 3) What is being done? (Goal) 3.1 Identify the Data Newly created data and/or legacy data? How is new data created? Do users create the data, or is it imported from an external source? How frequently is new data created/updated? What quantity of data is created? How much legacy data exists? Is it stored within a single source, or scattered across multiple sources? EarthBiAs2014 7-11 July 2014, Rhodes, Greece 3) What is being done? (Goal) 3.2 Identify the Tasks Creation Tasks Create/Generate Find Improve/ Edit / Fix Decision (Vote) Tasks Accept / Reject Thumbs up / Thumbs Down Vote for Best EarthBiAs2014 7-11 July 2014, Rhodes, Greece 4) How is it being done? (How) n Identify the workflow Tasks integrated in normal workflow of those creating and managing data Simple as vetting or rating results of algorithm n Identify the platform Internal/Community collaboration platforms Public crowdsourcing platform Consider the availability of appropriate workers (i.e. experts) n Identify the Algorithm Data quality Image recognition etc EarthBiAs2014 7-11 July 2014, Rhodes, Greece 4) How is it being done? (How) n Task Design Task Interface Task Assignment/Routing Task Quality Assurance EarthBiAs2014 7-11 July 2014, Rhodes, Greece Task Design 98 * Edith Law and Luis von Ahn, Human Computation - Core Research Questions and State of the Art Input Output Task Router before computation Output Aggregation after computation Task Interface during computation EarthBiAs2014 7-11 July 2014, Rhodes, Greece Pull Routing n Workers seek tasks and assign to themselves Search and Discovery of tasks support by platform Task Recommendation Peer Routing Workers Tasks Select Result Algorithm Search & Browse Interface Result EarthBiAs2014 7-11 July 2014, Rhodes, Greece Push Routing n System assigns tasks to workers based on: Past performance Expertise Cost Latency 100 Workers Tasks Assign Result Assign Algorithm Task Interface * www.mobileworks.com Result EarthBiAs2014 7-11 July 2014, Rhodes, Greece Managing Task Quality Assurance n Redundancy: Quorum Votes Replicate the task (i.e. 3 times) Use majority voting to determine right value (% agreement) Weighted majority vote n Gold Data / Honey Pots Inject trap question to test quality Worker fatigue check (habit of saying no all the time) n Estimation of Worker Quality Redundancy plus gold data n Qualification Test Use test tasks to determine users ability for such tasks EarthBiAs2014 7-11 July 2014, Rhodes, Greece Social Best Practices n Participation Stakeholders involvement for data producers and consumers must occur early in project Provides insight into basic questions of what they want to do, for whom, and what it will provide n Incentives Sheer curation needs line of sight from data curating activity, to tangible exploitation benefits Recognizing contributing curators through a formal feedback mechanism n Engagement Outreach essential for promotion and feedback Typical consumers-to-contributors ratios < 5% EarthBiAs2014 7-11 July 2014, Rhodes, Greece Technical Best Practices n Human & Automated Curation Automated curation should always defer to, and never override, human curation edits Automate validating data deposition and entry Target community at focused curation tasks n Track Provenance Curation activities should be recorded Especially where human curators are involved Different perspectives of provenance A scientist may need to evaluate the fine grained experiment description behind the data For a business analyst the brand of data provider can be sufficient for determining quality LINKED OPEN DATA EXAMPLE PART VI 7-11 July 2014, Rhodes, Greece EarthBiAs2014 EarthBiAs2014 7-11 July 2014, Rhodes, Greece Linked Open Data (LOD) n Expose and interlink datasets on the Web n Using URIs to identify things in your data n Using a graph representation (RDF) to describe URIs n Vision: The Web as a huge graph database Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.n EarthBiAs2014 7-11 July 2014, Rhodes, Greece Linked Data Example MulDple IdenDers IdenDty resoluDon links EarthBiAs2014 7-11 July 2014, Rhodes, Greece Identity Resolution in LOD n Quality issues with Linked Data Imprecise or Outdated or Wrong n Uncertainty of identity resolution links Due to multiple identity equivalence interpretations Due to characteristics of link generation algorithms (similarity based) n User feedback for uncertain links Verify uncertain identity resolution links from users/ experts Improve quality of entity consolidation EarthBiAs2014 7-11 July 2014, Rhodes, Greece Identity Resolution in LOD owl:sameAs Publisher owl:sameAs Consumer MulDple IdenDers for Galway enDty in Linked Open Data Cloud Dierent sources of idenDty resoluDon links Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ EarthBiAs2014 7-11 July 2014, Rhodes, Greece LOD Application Architecture Utility Module Feedback Module Consolidation Module Questions FeedbackRules Matching Dependencies Ranked Feedback Tasks Data Improvement Candidate Links Tom Heath and Christian Bizer (2011)Linked Data: Evolving the Web into a Global Data Space(1st edition), 1-136. Morgan & Claypool. FUTURE RESEARCH CHALLENGES PART IIV 7-11 July 2014, Rhodes, Greece EarthBiAs2014 EarthBiAs2014 7-11 July 2014, Rhodes, Greece Future Research Directions n Incentives and social engagement Better recognition of the data curation role Understanding of social engagement mechanisms n Economic Models Pre-competitive and public-private partnerships n Curation at Scale Evolution of human computation and crowdsourcing Instrumenting popular apps for data curation General-purpose data curation pipelines Human-data interaction n Trust Capture of data curation decisions & provenance management Fine-grained permission management models and tools n Data Curation Models Nanopulications Theoretical principles and domain-specific model EarthBiAs2014 7-11 July 2014, Rhodes, Greece Future Research Directions n Spatial Crowdsourcing Matching tasks with workers at right time and location Balancing workload among workers Tasks at remote locations Chaining tasks in same vicinity Preserving worker privacy n Interoperability Finding semantic similarity of tasks across systems Defining and measuring worker capability across heterogeneous systems Enabling routing middleware for multiple systems Compatibility of reputation systems Defining standards for task exchange EarthBiAs2014 7-11 July 2014, Rhodes, Greece Heterogeneous Crowds n Multiple requesters, tasks, workers, platform Collaborative Data Curation Tasks Workers Cyber Physical Social System Platforms EarthBiAs2014 7-11 July 2014, Rhodes, Greece SLUA Ontology 114 Reward Action Capability User Task offersearns includesperforms requirespossesses Location Skill Knowledge Ability Availability Reputation Money Fun Altruism Learning subClassOf subClassOf U. ul Hassan, S. ORiain, E. Curry, SLUA: Towards Semantic Linking of Users with Actions in Crowdsourcing, in International Workshop on Crowdsourcing the Semantic Web, 2013. EarthBiAs2014 7-11 July 2014, Rhodes, Greece Future Research Directions n Task Routing Optimizing task completion, quality, and latency Inferring worker preferences, skills, and knowledge Balancing exploration-exploitation trade-off between inference and optimization Cold-start problem for new workers or tasks Ensuring worker satisfaction via load balancing & rewards n HumanComputer Interaction Reducing search friction through good browsing interfaces Presenting requisite information nothing more Choosing the level of task granularity for complex tasks Ensuring worker engagement Designing games with a purpose to crowd source with fun EarthBiAs2014 7-11 July 2014, Rhodes, Greece Summary Algorithms Humans Better DataData EarthBiAs2014 7-11 July 2014, Rhodes, Greece Selected References n Big Data & Data Quality S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, Big Data, Analytics and the Path from Insights to Value, MIT Sloan Management Review, vol. 52, no. 2, pp. 2132, 2011. A. Haug and J. S. Arlbjrn, Barriers to master data quality, Journal of Enterprise Information Management, vol. 24, no. 3, pp. 288303, 2011. R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, Managing one master data challenges and preconditions, Industrial Management & Data Systems, vol. 111, no. 1, pp. 146 162, 2011. E. Curry, S. Hasan, and S. ORiain, Enterprise Energy Management using a Linked Dataspace for Energy Intelligence, in Second IFIP Conference on Sustainable Internet and ICT for Sustainability, 2012. D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2008. Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment.Communications of the ACM,45(4), 211-2 Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement.ACM Computing Surveys (CSUR),41(3), 16. B. Otto and A. Reichert, Organizing Master Data Management: Findings from an Expert Survey, in Proceedings of the 2010 ACM Symposium on Applied Computing - SAC 10, 2010, pp. 106110. Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33 Ul Hassan, U., ORiain, S., and Curry, E. 2012. Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications, In 9th International Workshop on Information Integration on the Web (IIWeb2012) Scottsdale, Arizona,: ACM. EarthBiAs2014 7-11 July 2014, Rhodes, Greece Selected References n Collective Intelligence, Crowdsourcing & Human Computation Malone, Thomas W., Robert Laubacher, and Chrysanthos Dellarocas. "Harnessing Crowds: Mapping the Genome of Collective Intelligence." (2009). A. Doan, R. Ramakrishnan, and A. Y. Halevy, Crowdsourcing systems on the World-Wide Web, Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011. A. J. Quinn and B. B. Bederson, Human computation: a survey and taxonomy of a growing field, in Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems, 2011, pp. 1403 1412. Mason, W. A., & Watts, D. J. (2009). Financial incentives and the performance of crowds. Proceedings of the Human Computation Workshop. Paris: ACM, June 28, 2009. E. Law and L. von Ahn, Human Computation, Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 5, no. 3, pp. 1121, Jun. 2011. M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, CrowdDB: Answering Queries with Crowdsourcing, in Proceedings of the 2011 international conference on Management of data - SIGMOD 11, 2011, p. 61. P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger, Exploring the Crowd as Enabler of Better Information Quality, in Proceedings of the 16th International Conference on Information Quality, 2011, pp. 302312. Winter A. Mason, Duncan J. Watts: Financial incentives and the "performance of crowds". SIGKDD Explorations (SIGKDD) 11(2):100-108 (2009) Panos Ipeirotis. Managing Crowdsourced Human Computation, WWW2011 Tutorial O. Alonso & M. Lease. Crowdsourcing 101: Putting the WSDM of Crowds to Work for You, WSDM Hong Kong 2011. D. A. Grier, When Computers Were Human, vol. 13. Princeton University Press, 2005. http://www.youtube.com/watch?v=YwqltwvPnkw Ul Hassan, U., & Curry, E. (2013, October). A capability requirements approach for predicting worker performance in crowdsourcing. In Collaborative Computing: Networking, Applications and Worksharing (Collaboratecom), 2013 9th Internatinal Conference Conference on (pp. 429-437). IEEE. EarthBiAs2014 7-11 July 2014, Rhodes, Greece Selected References n Collaborative Data Management E. Curry, A. Freitas, and S. O. Riain, The Role of Community-Driven Data Curation for Enterprises, in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 2547. Ul Hassan, U., ORiain, S., and Curry, E. 2012. Towards Expertise Modelling for Routing Data Cleaning Tasks within a Community of Knowledge Workers, In 17th International Conference on Information Quality (ICIQ 2012), Paris, France. Ul Hassan, U., ORiain, S., and Curry, E. 2013. Effects of Expertise Assessment on the Quality of Task Routing in Human Computation, In 2nd International Workshop on Social Media for Crowdsourcing and Human Computation, Paris, France. Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B., ... & Xu, S. (2013). Data Curation at Scale: The Data Tamer System. In CIDR. Parameswaran, A. G., Park, H., Garcia-Molina, H., Polyzotis, N., & Widom, J. (2012, October). Deco: declarative crowdsourcing. InProceedings of the 21st ACM international conference on Information and knowledge management(pp. 1203-1212). ACM. Parameswaran, A., Boyd, S., Garcia-Molina, H., Gupta, A., Polyzotis, N., & Widom, J. (2014). Optimal crowd- powered rating and filtering algorithms.Proceedings Very Large Data Bases (VLDB). Marcus, A., Wu, E., Karger, D., Madden, S., & Miller, R. (2011). Human-powered sorts and joins.Proceedings of the VLDB Endowment,5(1), 13-24. Guo, S., Parameswaran, A., & Garcia-Molina, H. (2012, May). So who won?: dynamic max discovery with the crowd. InProceedings of the 2012 ACM SIGMOD International Conference on Management of Data(pp. 385-396). ACM. Davidson, S. B., Khanna, S., Milo, T., & Roy, S. (2013, March). Using the crowd for top-k and group-by queries. InProceedings of the 16th International Conference on Database Theory(pp. 225-236). ACM. Chai, X., Vuong, B. Q., Doan, A., & Naughton, J. F. (2009, June). Efficiently incorporating user feedback into information extraction and integration programs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (pp. 87-100). ACM. EarthBiAs2014 7-11 July 2014, Rhodes, Greece Selected References n Spatial Crowdsourcing Kazemi, L., & Shahabi, C. (2012, November). Geocrowd: enabling query answering with spatial crowdsourcing. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems (pp. 189-198). ACM. Benouaret, K., Valliyur-Ramalingam, R., & Charoy, F. (2013). CrowdSC: Building Smart Cities with Large Scale Citizen Participation. IEEE Internet Computing, 1. Musthag, M., & Ganesan, D. (2013, April). Labor dynamics in a mobile micro-task market. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 641-650). ACM. Deng, Dingxiong, Cyrus Shahabi, and Ugur Demiryurek. "Maximizing the number of worker's self- selected tasks in spatial crowdsourcing."Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 2013. To, H., Ghinita, G., & Shahabi, C. (2014). A Framework for Protecting Worker Location Privacy in Spatial Crowdsourcing.Proceedings of the VLDB Endowment,7(10). Goncalves, J., Ferreira, D., Hosio, S., Liu, Y., Rogstadius, J., Kukka, H., & Kostakos, V. (2013, September). Crowdsourcing on the spot: altruistic use of public displays, feasibility, performance, and behaviours. InProceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing(pp. 753-762). ACM. Cardone, G., Foschini, L., Bellavista, P., Corradi, A., Borcea, C., Talasila, M., & Curtmola, R. (2013). Fostering participaction in smart cities: a geo-social crowdsensing platform.Communications Magazine, IEEE,51(6). EarthBiAs2014 7-11 July 2014, Rhodes, Greece Books n Surowiecki, J. (2005). The wisdom of crowds. Random House LLC. n Batini, C., & Scannapieco, M. (2006).Data quality: concepts, methodologies and techniques. Springer. n Michelucci, P. (2013).Handbook of human computation. Springer. n Law, E., & Ahn, L. V. (2011). Human computation.Synthesis Lectures on Artificial Intelligence and Machine Learning,5(3), 1-121. n Heath, T., & Bizer, C. (2011). Linked data: Evolving the web into a global data space.Synthesis lectures on the semantic web: theory and technology,1(1), 1-136. n Grier, D. A. (2013).When computers were human. Princeton University Press. n Easley, D., & Kleinberg, J. Networks, Crowds, and Markets.Cambridge University. n Sheth, A., & Thirunarayan, K. (2012). Semantics Empowered Web 3.0: Managing Enterprise, Social, Sensor, and Cloud-based Data and Services for Advanced Applications.Synthesis Lectures on Data Management,4(6), 1-175. EarthBiAs2014 7-11 July 2014, Rhodes, Greece Tutorials n Human Computation and Crowdsourcing http://research.microsoft.com/apps/video/default.aspx?id=169834 http://www.youtube.com/watch?v=tx082gDwGcM n Human-Powered Data Management http://research.microsoft.com/apps/video/default.aspx?id=185336 n Crowdsourcing Applications and Platforms: A Data Management Perspective http://www.vldb.org/pvldb/vol4/p1508-doan-tutorial4.pdf n Human Computation: Core Research Questions and State of the Art http://www.humancomputation.com/Tutorial.html n Crowdsourcing & Machine Learning http://www.cs.rutgers.edu/~hirsh/icml-2011-tutorial/ n Data quality and data cleaning: an overview http://dl.acm.org/citation.cfm?id=872875 EarthBiAs2014 7-11 July 2014, Rhodes, Greece Datasets n TREC Crowdsourcing Track https://sites.google.com/site/treccrowd/ n 2010 Crowdsourced Web Relevance Judgments Data https://docs.google.com/document/d/ 1J9H7UIqTGzTO3mArkOYaTaQPibqOTYb_LwpCpu2qFCU/edit n Statistical QUality Assurance Robustness Evaluation Data http://ir.ischool.utexas.edu/square/data.html n Crowdsourcing at Scale 2013 http://www.crowdscale.org/ n USEWOD - Usage Analysis and the Web of Data http://usewod.org/usewodorg-2.html n NAACL 2010 Workshop https://sites.google.com/site/amtworkshop2010/data-1 n mturk-tracker.com n GalaxyZoo.com n CrowdCrafting.com EarthBiAs2014 7-11 July 2014, Rhodes, Greece Credits Special thanks to Umair ul Hassan for his assistance with the Tutorial EarthBiAs2014