"stories" in data and the roles of crowdsourcing – views of a web miner

"Stories" in data and the roles of crowdsourcing – views of a Web miner

Bettina Berendt

Dept. of Computer ScienceKU Leuven, Belgiumhttp://people.cs.kuleuven.be/~bettina.berendt/

Thanks to: Ilija Subašić, Markus Luczak-Rösch, and Laura Drăgan

http://people.cs.kuleuven.be/~bettina.berendt/

http://people.cs.kuleuven.be/~bettina.berendt/

A story

Story structure

One case of provenance

Another case of provenance

Formalizing provenance: a high-level view

Challenge 1:Many voices

Challenge 2

Challenge 3:subjectivity

The STORIES Tool

Uncover (1)

Uncover (2)

Scan (over time)

Uncover

Search: formulating ad-hoc concepts

Track (2)

Textual summarization

Challenge 4

Crowd-sourcing the truth? Wikipedia (here: the Gaza Flotilla Raid)

Challenge 5

Challenge 4: More specifically

Challenge 5: vagueness - reprise

The “live crowdsourcing activity“•Goal: crowdsource data citation metadata•Motivation 1 / possible extension

•Motivation 2 / case study

http://prov.usewod.org

The data

Datasets

Publications

[People]

The datasets

Preloaded:

– USEWOD datasets– DBpedia– SWDF– Bio2RDF– LinkedGeoData– BioPortal– OpenBioMed

The datasets

Preloaded:

– Generic (!)– Versions/releases– References

The datasets

Add new:

– Name*– Version– Release date– URL

The publications

Preloaded:

– USEWOD workshop papers

The publications

Add new:

– Title*– Authors– Year– URL

The data

The task

Capture

which dataset is used in which publication

and

how

Data representation

Datasets

Publications

Connections between them

schema.org

prov:Entity

?

Data representation

Datasets

Publications

Connections between them

schema.org

prov:Entity

prov:Derivation

The task

Capture


and

how

Connections

Publication – Publication

Publication – Dataset

Dataset – Publication

Dataset - Dataset

Connections

Publication – Publication

citation

Connections

Publication – Dataset

Dataset – Publication

mentions

describes

evaluates

analyses

compares

Connections

Dataset – Dataset

extends

includes

overlaps

transformation of

generalisation of

Data representation

Subclasses of prov:Derivation

(inverse of Publication-DS)

The task

Capture


and

how

Data representation

Bundles

Live crowdsourcing activity 2014: outcomes

Participants 6

Bundles 81 avg: 13.5, min: 2, max:27

Publications 19

Datasets 2 (3)

Connections 95 Inclusion: 62 Analysis: 21, Mention: 6

Lessons learned

Data is dirty

– even coming from experts

Focus on the task

– make everything else simpler– minimise data input

Questionnaire results

Inconclusive results on the suitability of the vocabulary,

But interesting answers to: „“what questions would this information answer for you?“:

● “What are popular datasets?”● “Which datasets are facilitators for research

on X?”● “What publications are related through a

dataset (but don't mention each other)?”

• What is outsourced• Who is the crowd• How is the task designed• How are the results validated• How can the process be optimised

[Quinn & Bederson, 2012]

Outlook (1): Dimensions of crowdsourcing

Dimensions Specific questions• [Who] Which crowd(s)? Experts & non-experts• [What] Enhanced by IE?• [design/validation] How to combine these

sources of metadata?• [Optimisation] Incentives?

▫“Student science“?▫Citizen science?▫“Learner science“?

Enlarging the scope: “How come ...?“ Storytelling

THANK YOU!

Some references:

• Subašić, I. & Berendt, B. (2009). Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowledge and Information Systems. http://people.cs.kuleuven.be/~bettina.berendt/Papers/subasic_berendt_2009.pdf

• Berendt, Bettina; Last, Mark; Subasic, Ilija; Verbeke, Mathias (2013). New formats and interfaces for multi-document news summarization and its evaluation, In: Fiori, Alessandro (ed.), Innovative Document Summarization Techniques: Revolutionizing Knowledge Understanding. IGI Global. https://lirias.kuleuven.be/bitstream/123456789/423917/1/berendt_last_subasic_verbeke_2013_withbib.pdf

• Dragan, Laura, Luczak-Rösch, Markus, Simperl, Elena, Berendt, Bettina and Moreau, Luc (2014) Crowdsourcing data citation graphs using provenance. In, Provenance Analytics (ProvAnalytics2014), Cologne, DE, 09 Jun 2014. 4pp. http://eprints.soton.ac.uk/365374/

• ~ Presentation at LCPD 2014 : Second workshop on Interlinking and Contextualizing Publications and Datasets, to appear in DLIB Magazine

http://people.cs.kuleuven.be/~bettina.berendt/Papers/subasic_berendt_2009.pdf

http://people.cs.kuleuven.be/~bettina.berendt/Papers/subasic_berendt_2009.pdf

https://lirias.kuleuven.be/bitstream/123456789/423917/1/berendt_last_subasic_verbeke_2013_withbib.pdf

https://lirias.kuleuven.be/bitstream/123456789/423917/1/berendt_last_subasic_verbeke_2013_withbib.pdf

http://eprints.soton.ac.uk/365374/

http://eprints.soton.ac.uk/365374/

"stories" in data and the roles of crowdsourcing – views of a web miner

Documents