content mining at wellcome trust

21
CONTENT-MINING IN SCIENCE TheContentMine Progress since “Hargreaves” legislation Opportunities for UK, and Europe Peter Murray-Rust, 2015-04-14 Workshop sponsored by Wellcome Trust

Upload: petermurrayrust

Post on 15-Jul-2015

764 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Content Mining at Wellcome Trust

CONTENT-MINING IN SCIENCE

TheContentMineProgress since “Hargreaves” legislation

Opportunities for UK, and Europe

Peter Murray-Rust, 2015-04-14Workshop sponsored by Wellcome Trust

Page 2: Content Mining at Wellcome Trust

OUR TEAM

@jenny_molloy

Ross Mounce

@rmounce

Richard Smith-Unna

@blahah404

Stephanie Smith-Unna

@treblesteph

Jenny Molloy

Mark MacGillivray

@cottagelabs

Peter Murray-Rust

@petermurrayrust

Charles Oppenheim

@CharlesOppenh

Graham Steel

@McDawg

Page 3: Content Mining at Wellcome Trust

OUR MISSION

“make 100,000,000 facts from the STEM literature open, accessible and reusable”

Page 4: Content Mining at Wellcome Trust

WHY?

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-

ebola.html

We were stunned recently when we stumbled across an article by European

researchers in Annals of Virology [1982]: “The results seem to indicate that

Liberia has to be included in the Ebola virus endemic zone.” In the future,

the authors asserted, “medical personnel in Liberian health centers should be

aware of the possibility that they may come across active cases and thus be

prepared to avoid nosocomial epidemics,” referring to hospital-acquired

infection.

Adage in public health: “The road to inaction is paved with research papers.”

Bernice Dahn is the chief medical officer of Liberia’s Ministry of Health,

where Vera Mussah is the director of county health services. Cameron Nutt

is the Ebola response adviser to Partners in Health.

Page 5: Content Mining at Wellcome Trust

THE RIGHT TO READ ISTHE RIGHT TO MINE

The Hargreaves report (UK) , legalised 2014, allowing

limitations and exceptions for non-commercial content mining

for research.The Hague decal

Page 6: Content Mining at Wellcome Trust

THE SCALE OF THE TASK

• ~ 27,000 peer reviewed journals*

• > 5,000 publishers

• ~ 3,000 new papers per day

• “costing” 15 Billion USD to publish

• Representing 500 Billion USD of research

*Ulrich’s database: http://ulrichsweb.serialssolutions.com/login

Page 7: Content Mining at Wellcome Trust

OUR WORKSHOPS

• Shuttleworth Foundation• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute (x2)• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London• LIBER • Cochrane UK• British Library• Wellcome Trust• WHO

OUR COLLABORATORS

• Shuttleworth Foundation

• Wikimedia/Wikidata

• Mozilla

• Open Knowledge

• LIBER

• British Library

• Wellcome Trust

• EBI (Eur. Bioinf. Inst.)

• JISC

• BBSRC

• Cochrane UK

• Open Access Button

• SPARC

• Creative Commons

• CORE

• EuropePubmedCentral

• Cambridge University Library

Page 8: Content Mining at Wellcome Trust

STRUCTURED INFORMATION

• chemical names and structures

• species

• metabolism

• phylogenetic trees

• …

Page 9: Content Mining at Wellcome Trust

INTERACTIVE DEMOof content mining

http://chemicaltagger.ch.cam.ac.uk/

Page 10: Content Mining at Wellcome Trust

ContentMine at Cochrane UK, 2015-03-16

Page 11: Content Mining at Wellcome Trust

CLINICAL TRIALS

How to we find (mentions of) clinical trials?Is a document a (clinical) trial?What is the subject of the trial?

What is the methodology used? How many/long?Does the design and practice conform to CONSORT?

What are the outcomes?Can we extract specific re-usable information?

Who are involved? (researchers, sponsors, patients?)Has a proposed trial been completed and reported?

Page 12: Content Mining at Wellcome Trust

COMMUNITY PROJECTS

• Clinical Trials (with Cochrane UK)

• Phyloinformatic Literature Unlocking Tools (PLUTo/BBSRC)

• EBI – MetaboLights

• Plant Sciences and farming (Cambridge, TGAC, OpenFarm)

• Crystallography Open Database (COD)

• OpenOil / OpenCorporates

Page 13: Content Mining at Wellcome Trust

METABOLIGHTS

• European Bioinformatics Institute

• database for metabolomics experiments and

derived information

• cross-species, cross-technique, structures,

biological roles, locations, concentrations

• http://www.ebi.ac.uk/metabolights/

Page 14: Content Mining at Wellcome Trust

CONTENTMINE WORKSHOPS AND HACKDAYS

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a day

Start simple: bagOfWords, Stemming, Regex, templates

Page 15: Content Mining at Wellcome Trust

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 16: Content Mining at Wellcome Trust

What is “Content”?Emily Sena (neuroscience.ed.ac.uk) spends

half a day digitising a diagram like this

ContentMine will soon be able to do it in 1 second

Page 17: Content Mining at Wellcome Trust

Note Jaggy andbroken pixels

NEW Bacteria must have a phylogenetic tree

Length_________Weight

Binomial Name Culture/Strain GENBANK ID

EvolutionRate

Page 18: Content Mining at Wellcome Trust

• CRAWL the web for scientific documents(articles, grey literature, repositories)

• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form

…Open semantic science …• MINE pages with your methods and tools (AMI)

• CAT-alogue results in searchable index• Automate daily process (CANARY)

contentmine.org Infrastructure

Page 19: Content Mining at Wellcome Trust

quickscrapeCrawlFeed

NormaIndex &

Transform

PDF

XML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

Plugins

Regex

SequencesSpecies

Bespoke

Scrapers

XPathPer-Journal

Taggers

Per- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

Page 20: Content Mining at Wellcome Trust

POSSIBLE USES

• Indexing/searching the literature; G***** for science

• Current awareness; alerts and practices

• Extraction and re-use of facts; re-computation

• Multidisciplinary integration; co-occurrence

• Compliance with funder/institution policies

• Managing your Research Data!

• Finding similar and complementary colleagues

• Reproducibility, checking data and avoiding fraud

Page 21: Content Mining at Wellcome Trust

How to leverage Content Mining for benefit of UK/EU

• Create UK showcase of successes in mining

• Graduate training by 3rd year UK graduate students.

• Develop EuropePMC as world resource for bio-mining

• Training/support for UK/EU libraries about Hargreaves.

• Central collection of born-digital UK theses

• Collect pre-copyright author manuscripts

• Integrate CM into Research Data Management tools

• Promote mining in all aspects of healthcare information

• Open collection of extracted scientific facts for the world