crowdsourcing approaches to big data curation for earth sciences

Download Crowdsourcing Approaches to Big Data Curation for Earth Sciences

Post on 11-Aug-2014

852 views

Category:

Data & Analytics

36 download

Embed Size (px)

TRANSCRIPT

EarthBiAs2014 Global NEST University of the Aegean Crowdsourcing Approaches to Big Data CuraDon for Earth Sciences Insight Centre for Data AnalyDcs, NaDonal University of Ireland Galway EarthBiAs2014 1 Take Home Algorithms Humans Better DataData Talk Overview Part I: Mo=va=on Part II: Data Quality And Data Cura=on Part III: Crowdsourcing Part IV: Case Studies on Crowdsourced Data Cura=on Part V: SeIng up a Crowdsourced Data Cura=on Process Part VI: Linked Open Data Example Part IIV: Future Research Challenges 7-11 July 2014, Rhodes, Greece EarthBiAs2014 MOTIVATION PART I 7-11 July 2014, Rhodes, Greece EarthBiAs2014 BIG Big Data Public Private Forum THE BIG PROJECT Overall objective Bringing the necessary stakeholders into a self-sustainable industry-led initiative, which will greatly contribute to enhance the EU competitiveness taking full advantage of Big Data technologies. Work at technical, business and policy levels, shaping the future through the positioning of IIM and Big Data specifically in Horizon 2020. BIGBig Data Public Private Forum BIG Big Data Public Private Forum SITUATING BIG DATA IN INDUSTRY Health Public Sector Finance & Insurance Telco, Media& Entertainment Manufacturing, Retail, Energy, Transport Needs Offerings Value Chain Technical Working Groups Industry Driven Sectorial Forums Data Acquisition Data Analysis Data Curation Data Storage Data Usage Structured data Unstructured data Event processing Sensor networks Protocols Real-time Data streams Multimodality Stream mining Semantic analysis Machine learning Information extraction Linked Data Data discovery Whole world semantics Ecosystems Community data analysis Cross-sectorial data analysis Data Quality Trust / Provenance Annotation Data validation Human-Data Interaction Top-down/Bottom-up Community / Crowd Human Computation Curation at scale Incentivisation Automation Interoperability In-Memory DBs NoSQL DBs NewSQL DBs Cloud storage Query Interfaces Scalability and Performance Data Models Consistency, Availability, Partition- tolerance Security and Privacy Standardization Decision support Prediction In-use analytics Simulation Exploration Visualisation Modeling Control Domain-specific usage BIG Big Data Public Private Forum SUBJECT MATTER EXPERT INTERVIEWS BIG Big Data Public Private Forum KEY INSIGHTS Key Trends Lower usability barrier for data tools Blended human and algorithmic data processing for coping with for data quality Leveraging large communities (crowds) Need for semantic standardized data representation Significant increase in use of new data models (i.e. graph) (expressivity and flexibility) Much of (Big Data) technology is evolving evolutionary But business processes change must be revolutionary Data variety and verifiability are key opportunities Long tail of data variety is a major shift in the data landscape The Data Landscape Lack of Business-driven Big Data strategies Need for format and data storage technology standards Data exchange between companies, institutions, individuals, etc. Regulations & markets for data access Human resources: Lack of skilled data scientists Biggest Blockers Technical White Papers available on: http://www.big-project.eu EarthBiAs2014 7-11 July 2014, Rhodes, Greece The Internet of Everything: Connecting the Unconnected EarthBiAs2014 7-11 July 2014, Rhodes, Greece Earth Science Systems of Systems EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece CiDzen Sensors humans as ci,zens on the ubiquitous Web, ac,ng as sensors and sharing their observa,ons and views Sheth, A. (2009). Ci=zen sensing, social signals, and enriching human experience. Internet Compu,ng, IEEE, 13(4), 87-92. Air Pollution EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece EarthBiAs2014 7-11 July 2014, Rhodes, Greece Citizens as Sensors EarthBiAs2014 7-11 July 2014, Rhodes, Greece 16 of XYZ Haklay, M., 2013, Citizen Science and Volunteered Geographic Information overview and typology of participation in Sui, D.Z., Elwood, S. and M.F. Goodchild (eds.), 2013. Crowdsourcing Geographic Knowledge: Volunteered Geographic Information (VGI) in Theory and Practice . Berlin: Springer. DATA QUALITY AND DATA CURATION PART II 7-11 July 2014, Rhodes, Greece EarthBiAs2014 EarthBiAs2014 7-11 July 2014, Rhodes, Greece The Problems with Data Knowledge Workers need: Access to the right data Confidence in that data Flawed data effects 25% of critical data in worlds top companies Data quality role in recent financial crisis: Asset are defined differently in different programs Numbers did not always add up Departments do not trust each others figures Figures not worth the pixels they were made of EarthBiAs2014 7-11 July 2014, Rhodes, Greece What is Data Quality? Desirable characteristics for information resource Described as a series of quality dimensions: n Discoverability & Accessibility: storing and classifying in appropriate and consistent manner n Accuracy: Correctly represents the real-world values it models n Consistency: Created and maintained using standardized definitions, calculations, terms, and identifiers n Provenance & Reputation: Track source & determine reputation Includes the objectivity of the source/producer Is the information unbiased, unprejudiced, and impartial? Or does it come from a reputable but partisan source? Wang, R. and D. Strong, Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 1996. 12(4): p. 5-33. EarthBiAs2014 7-11 July 2014, Rhodes, Greece Data Quality ID PNAME PCOLOR PRICE APNR iPod Nano Red 150 APNS iPod Nano Silver 160 150 5=on> Source A Source B Schema Difference? Data Developer APNR iPod Nano Red 150 APNR iPod Nano Silver 160 iPod Nano IPN890 150 5 Value Conflicts? Entity Duplication? Data Steward Business Users ? Technical Domain (Technical) Domain EarthBiAs2014 7-11 July 2014, Rhodes, Greece What is Data Curation? n Digital Curation Selection, preservation, maintenance, collection, and archiving of digital assets n Data Curation Active management of data over its life-cycle n Data Curators Ensure data is trustworthy, discoverable, accessible, reusable, and fit for use Museum cataloguers of the Internet age EarthBiAs2014 7-11 July 2014, Rhodes, Greece Related Activities n Data Governance/ Master Data Management Convergence of data quality, data management, business process management, and risk management Part of overall data governance strategy for organization n Data Curator = Data Steward EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation n Multiple approaches to curate data, no single correct way Who? Individual Curators Curation Departments Community-based Curation How? Manual Curation (Semi-)Automated Sheer Curation EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation Who? n Individual Data Curators Suitable for infrequently changing small quantity of data (million records) Availability: Post-hoc nature creates delay in curated data availability EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation - Who? n Community-Based Data Curation Decentralized approach to data curation Crowd-sourcing the curation process Leverages community of users to curate data Wisdom of the community (crowd) Can scale to millions of records EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation How? n Manual Curation Curators directly manipulate data Can tie users up with low-value add activities n (Sem-)Automated Curation Algorithms can (semi-)automate curation activities such as data cleansing, record duplication and classification Can be supervised or approved by human curators EarthBiAs2014 7-11 July 2014, Rhodes, Greece Types of Data Curation How? n Sheer curation, or Curation at Source Curation activities integrated in normal workflow of those creating and managing da