towards automating data narratives

23
TOWARDS AUTOMATING DATA NARRATIVES Yolanda Gil, Daniel Garijo Information Sciences Institute University of Southern California @yolandagil, @dgarijov {gil,dgarijo}@isi.edu Information Sciences Institute

Upload: dgarijo

Post on 11-Apr-2017

120 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Towards Automating Data Narratives

TOWARDS AUTOMATING DATA NARRATIVES

Yolanda Gil, Daniel GarijoInformation Sciences Institute

University of Southern California@yolandagil, @dgarijov

{gil,dgarijo}@isi.edu

Information Sciences Institute

Page 2: Towards Automating Data Narratives

The Scientific Research Process

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo

Formulate hypothesis

Define the experiment (data + method)

Find data

Run experiments (methods)

Meta-analysis of results

Revise hypothesis

Page 3: Towards Automating Data Narratives

The products of scientific research

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 3

Formulate hypothesis

Define the experiment (data + method)

Find data

Run experiments (methods)

Meta-analysis of results

Revise hypothesis

Publication

Methods

Data

SoftwareExecution traces

Page 4: Towards Automating Data Narratives

Reconstructing the Computations from the Text in the Paper

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 4

Comparison of Ligand Binding Sites

The SMAP software was used to compare the binding sites of the 749 M.tb protein structures plus 1,446 homology models (a total of 2,195 protein structures) with the 962 binding sites of 274 approved drugs, in an all-against-all manner. While the binding sites of the approved drugs were already defined by the bound ligand, the entire protein surface of each of the 2,195 M.tb protein structures was scanned in order to identify alternative binding sites. For each pairwise comparison, a P -value representing the significance of the binding site similarity was calculated.

“The Mycobacterium Tuberculosis Drugome and Its Polypharmacological Implications.” Kinnings, S. L.; Xie, L.; Fung, K. H.; Jackson, R. M.; Xie, L.; and Bourne, P. E. PLoS Computational Biology, 2011.

Page 5: Towards Automating Data Narratives

Problem with current approaches: what the paper said vs what the software did

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 5

“The Mycobacterium Tuberculosis Drugome and Its Polypharmacological Implications.” Kinnings, S. L.; Xie, L.; Fung, K. H.; Jackson, R. M.; Xie, L.; and Bourne, P. E. PLoS Computational Biology, 2011.

Actual computation

Page 6: Towards Automating Data Narratives

Problem with current approaches

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 6

Incomplete Missing steps and intermediate

data

Ambiguous Several interpretations about how

computations are done

Inconsistent level of detail Mixing of general methods

with execution details

Step1

Step ??

Step 2

?

Step1

Step 2

Step1’

Step 2’

Implementation 1?

Implementation 2?

Step1

Step 2

Param1 = 2

File = “Input.txt”

Page 7: Towards Automating Data Narratives

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 7

Formulate hypothesis

Define the experiment (data + method)

Find data

Run experiments (methods)

Meta-analysis of results

Revise hypothesis

Publication

Methods

Data

http://ext.net/wp-content/uploads/tortoise-svn-logo.pngExecution traces

Reportgeneration

Our approach: From research outputs to text

https://image.flaticon.com/icons/svg/28/28842.svg

Page 8: Towards Automating Data Narratives

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 8

Formulate hypothesis

Define the experiment (data + method)

Find data

Run experiments (methods)

Meta-analysis of results

Revise hypothesis

Publication

Methods

Data

http://ext.net/wp-content/uploads/tortoise-svn-logo.pngExecution traces

Reportgeneration

Our approach: From research outputs to text

http://www.hurricanesoftwares.com/wp-content/uploads/2009/03/import-CSV-in-php.png

Reports must:• Be true to actual events• Enable inspection • Be human-understandable• Abstract details

Page 9: Towards Automating Data Narratives

Data Narratives• Interlinked record of• High level workflows (methods)• Provenance of results (method executions)• Data• Software metadata

• Persistent identifiers

• Data narrative accounts • Alternative descriptions of a result with a different level of detail.

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 9

http://bitpoetry.io/content/images/2016/03/uriurnurl.png https://en.wikipedia.org/wiki/File:DOI_logo.svg

Truth to actual records

Inspectability

Human readable, levels of abstraction

Page 10: Towards Automating Data Narratives

Data Narrative Accounts: An example

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 10

How was the dataset used in this visualization generated?

Page 11: Towards Automating Data Narratives

Data Narrative Accounts: An example

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 11

“Topic modeling was run on the Reuters R8 dataset (10.6084/ m9.figshare.776887), and English Words dataset (10.6084/m9.figshare.776888), with iterations set to 100, stop word size set to 3, number of topics set to 10 and batch size set to 10. The results are at 10.6084/m9.figshare.776856”

“The topics at 10.6084/m9.figshare.776856 were found in the Reuters R8 dataset (10.6084/m9.figshare.776887) and English Words dataset (10.6084/m9.figshare.776888)”

• Execution view• Inputs, parameters and main outputs

• Data view• Just the data that influenced the results

• Method view• Main steps based on their functionality“Topic training was run on the input dataset. The results are product of PlotTopics, a visualization step”

Page 12: Towards Automating Data Narratives

• Dependency view• How the steps depend on each other

• Implementation view• How the steps were implemented in the execution

• Software view• Details on the software used to implement the steps

Data Narrative Accounts: An example

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 12

“First, the input data is filtered by Stop Words, followed by Small Words, Format Dataset, and Train Topics. The final results are produced by Plot Topics”

“Train topics was implemented using Latent Dirichlet allocation”

“The train topics step was generated with Online LDA open source software, written in Java. Plot topics was generated with the Termite software.”

Page 13: Towards Automating Data Narratives

DANA: DAta NArratives

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 13

Experiment Records

Provenance RepositoryExperiment-

specificKnowledge Base

DANA Generator

Narrativeaccounts Software

registry

Query patterns

Data Narrative aggregator

InputResourcerequest

Response

Resourcerequest

ResponseOutput

Get query Patternresult

Get pattern

1. Identify which experiment records to describe2. Generation of an Experiment-specific knowledge base3. Creation of the Data Narrative from templates4. Produce narrative accounts

Page 14: Towards Automating Data Narratives

Generation of an experiment-specific knowledge base: scientific workflows

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 14

WINGS workflow system

• High level workflow templates that can be elaborated through component ontologies

http://www.wings-workflows.org/

Page 15: Towards Automating Data Narratives

Generation of an experiment-specific knowledge base: provenance records as RDF

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 15

See a hyperlinked description/visualization at its persistent URL:https://goo.gl/v8EPg5

http://www.opmw.org/export/page/resource/WorkflowExecutionAccount/ACCOUNT1348628778528

10.6084/m9.figshare.776887

Page 16: Towards Automating Data Narratives

Generation of an experiment-specific knowledge base: Software metadata• Catalog of motifs [Garijo et al 2013]

• A catalog of common domain independent workflow patterns based on the functionality of workflow steps

• Ontosoft distributed software registry [Gil et al 2016]• Descriptions of hundreds of software components• Key metadata of software:

• License• Usage• Authors• Web page• Code repository• Etc.

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 16

[Garijo et al 2016]: Common Motifs in Scientific Workflows: An Empirical Analysis. Garijo, D.; Alper, P.; Belhajjame, K.; Corcho, O.; Gil, Y.; and Goble, C. Future Generation Computer Systems, . 2013. .

http://purl.org/net/wf-motifs

http://www.ontosoft.org/portal

[Gil et al 2016]: OntoSoft: A Distributed Semantic Registry for Scientific Software. Gil, Y.; Garijo, D.; Mishra, S.; and Ratnakar, V. In Proceedings of the Twelfth IEEE Conference on eScience, Baltimore, MD, 2016.

Page 17: Towards Automating Data Narratives

Generating narrative accounts

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 17

RDF

Accounttemplate

Page 18: Towards Automating Data Narratives

Formative evaluation• Survey with 6 target scenarios

• Each scenario:• Description of a situation where a user has to do a task

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 18

Page 19: Towards Automating Data Narratives

Formative evaluation• Survey with 6 target scenarios

• Each scenario:• Description of a situation where a user has to do a task• A workflow sketch of the analysis done• Six candidate narratives of that workflow sketch.

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 19

Page 20: Towards Automating Data Narratives

Formative evaluation• Survey with 6 target scenarios

• Each scenario:• Description of a situation where a user has to do a task• A workflow sketch of the analysis done• Six candidate narratives of that workflow sketch.

• 12 responses from users

• Results

• Each narrative is considered appropriate for describing some scenario

• Different users chose different narratives for each scenario

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 20

Page 21: Towards Automating Data Narratives

Summary: Benefits of Data Narratives

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 21

Features Data Narratives

Provenance Records

Visualizations Articles Electronic Notebooks

Truth to actual records Y Y Just data Maybe Maybe

Enable inspection Y Y Just data N Y

Human understandable Y N Y Y Y

Abstract details Y N Y Y N

Part of papers Y N Y Y Maybe

Persistent Y Maybe N Y Maybe

Different audiences Y N N N N

Automatically generated Y Y Maybe N N

Page 22: Towards Automating Data Narratives

Conclusions and future work• Data Narratives• Interlink data, software, workflows and provenance of a scientific experiment• Persistent identifiers• Narrative accounts

• Future work:• Ease navigation through levels of detail• Mixing details of different narratives• Improve summarization of results• Additional evaluation of narrative usefulness

Towards Automating Data Narratives. Yolanda Gil and Daniel Garijo 22

See more: http://dgarijo.github.io/DataNarratives/

Page 23: Towards Automating Data Narratives

TOWARDS AUTOMATING DATA NARRATIVES

Yolanda Gil, Daniel GarijoInformation Sciences Institute andDepartment of Computer Science

@yolandagil, @dgarijov

{gil,dgarijo}@isi.edu