Download - Text and Data Mining explained at FTDM
![Page 1: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/1.jpg)
Content Mining of Science and Medicine
Peter Murray-Rust, ContentMine.org and UniversityofCambridgeFTDM Knowledge Cafe, Leiden, NL, 2016-02-29
F/OSS tools from contentmine.org
Images from Wikimedia CC-BY-SA
![Page 2: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/2.jpg)
Disclaimer
The opinions, software and objects in this presentation are those of PMR+ContentMine (CM), in its non-FutureTDM role. No FTDM resources were used in creating slides, software, artefacts.
PMR has tried to give an objective listing of most of the main components of TDM, but has used CM technology to illustrate this.
![Page 3: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/3.jpg)
The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011
http://contentmine.org
![Page 4: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/4.jpg)
Mining strategy• Discover. negotiate permissions . => bibliography• Crawl / Scrape (download), documents AND
supplemental • Normalize. PDF => XML• Index: facets => Facts and snippets (“entities”)• Interpret/analyze entities => relationships,
aggregations (“Transformative”) • Publish
![Page 5: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/5.jpg)
catalogue
getpapers
query
DailyCrawl
EPMC, arXivCORE , HAL,(UNIV repos)
ToCservices
PDF HTMLDOC ePUB TeX XML
PNGEPS CSV
XLSURLsDOIs
crawl
quickscrape
normaNormalizerStructurerSemanticTagger
Text
DataFigures
ami
UNIVRepos
search
LookupCONTENTMINING
Chem
Phylo
Trials
CrystalPlants
COMMUNITY
plugins
Visualizationand Analysis
PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…
Publisher Sites
scrapersqueries
taggers
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
30, 000 pages/day Semantic ScholarlyHTML
Facts
CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
![Page 6: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/6.jpg)
Semantic Fulltext• EuropePMC coherent OpenAccess• getpapers: query , download (through API).• AMI filters, checks[1], transforms facts in papers.
• sequences, species, genera, genes, dictionaries
[0] All operations shown run in total of <3 minutes.[1] Dictionaries and lookup.[2] Usable from home by anyone
Zika endemic areasWikimedia CC-BY-SA
![Page 7: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/7.jpg)
Download all Open Access “Zika” from EuropePMC in 10 seconds (click below for movie)
Aedes aegypti, Wikimedia CC-BY-SA
Note: movies of this and other slides can be seen at https://vimeo.com/154705161
![Page 8: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/8.jpg)
Downloaded all Open Access “Zika” from EuropePMC in 10 seconds
Final download screen
![Page 9: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/9.jpg)
Eyeballing 20/120 Zika papers, click below for movie
Yellow Fever Virus Wikimedia CC-BY-SA
Note: movie of this and other slides can be seen at https://vimeo.com/154705161
![Page 10: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/10.jpg)
3011 virus 1939 Ae./Aedes 1212 dengue 901 mosquito/es 894 species 791 ZIKV 721 using 716 DENV 567 detection 513 aegypti 484 infection 442 RNA 428 protein 401 albopictus 360 viral
Commonest words in 120 Zika papers
Mosquito spp. Wikimedia CC-BY-SA
![Page 11: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/11.jpg)
Filtering local files for sequence and viruses
AMI (part of ContentMine software)
(click below for movie)Note: movies of this and other slides can be seen at https://vimeo.com/154705161
![Page 12: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/12.jpg)
DNA Primers in running text
…the sodium channel voltage dependent gene (Nav). Primers used to amplify this fragment were AaNaA 5’-ACAATGTGGATCGCTTCCC-3’ and AaNaB 5’-TGGACAAAAGCAAGGCTAAG-3’(8). The primers amplify a fragment of approximately 472…
Snippet (quotable under 2014 UK Statutory Instrument (“Hargreaves”):
~/PMC4654492/results/sequence/dnaprimer/results.xml”
W3C Annotation
[PREFIX] [MATCH] (link to target)[SUFFIX]
CMine structure
pluginoption
DNA double stranded fragment Wikimedia CC-BY-SA
![Page 13: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/13.jpg)
Commonest species in 120 Zika papers423 Ae./Aedes aegypti 333 Ae./Aedes albopictus 63 Ae. bromeliae 58 Ae. lilii 46 Ae. hensilli 42 Glossina pallidipes 40 Plasmodium vivax 35 Ae. luteocephalus 28 Ae. vittatus 25 Ae. furcifer 22 Plasmodium falciparum 21 Drosophila melanogaster
pre=“fever (DHF), are caused by the world's most prevalent mosquito-borne virus. 37 DENV is carried by " exact="Aedes aegypti” post=" mosquito, which is strongly affected by ecological and human drivers, but also influenced by clima" name="binomial"/>
![Page 14: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/14.jpg)
183 Wolbachia 70 Aedes 69 Flavivirus/Flaviviridae 30 Glossina 17 Culex
Commonest genera in Zika papers
pre=”…-negative endosymbiotic bacterium, is a promising tool against diseases transmitted by mosquitoes. " exact="Wolbachia” post=" can be found worldwide in numerous arthropod species. More than 65% of all insect species are natu…”
Wolbachia in insect cell Wikimedia CC-BY-SA
![Page 15: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/15.jpg)
38 ITS20 MHC2TA19 COI20 CYPJ9221 CYP6BB222 CYP9J283 MHC
Commonest genes in 120 Zika papers
![Page 16: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/16.jpg)
• microcephaly 400/2400 papers; 2 mins;
commonest genes:
203 MCPH1 86 MECP2 54 SOX2 49 E2F1 47 SNAP29 40 IKBKG 40 NDE1
N-terminal domain of microcephalin Wikimedia CC-BY-SA
![Page 17: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/17.jpg)
Systematic Reviews
Researchers and their machines need to “read” hundreds of papers a day or even more.
![Page 18: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/18.jpg)
Polly has 20 seconds to read this paper…
…and 10,000 more
![Page 19: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/19.jpg)
ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”
![Page 20: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/20.jpg)
400,000 Clinical TrialsIn 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s happened in last 6 years??
Search the whole scientific literatureFor “2009-0100068-41”
![Page 21: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/21.jpg)
Extracting scientific information
![Page 22: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/22.jpg)
Mining strategy• Discover. negotiate permissions . => bibliography• Crawl / Scrape (download), documents AND
supplemental • Normalize. PDF => XML• Index: facets => Facts and snippets (“entities”)• Interpret/analyze entities => relationships,
aggregations (“Transformative”) • Publish
![Page 23: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/23.jpg)
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
![Page 24: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/24.jpg)
catalogue
getpapers
query
DailyCrawl
EuPMC, arXivCORE , HAL,(UNIV repos)
ToCservices
PDF HTMLDOC ePUB TeX XML
PNGEPS CSV
XLSURLsDOIs
crawl
quickscrape
normaNormalizerStructurerSemanticTagger
Text
DataFigures
ami
UNIVRepos
search
LookupCONTENTMINING
Chem
Phylo
Trials
CrystalPlants
COMMUNITY
plugins
Visualizationand Analysis
PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…
Publisher Sites
scrapersqueries
taggers
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
30, 000 pages/day Semantic ScholarlyHTML
Facts
CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
![Page 25: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/25.jpg)
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
![Page 26: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/26.jpg)
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
![Page 27: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/27.jpg)
Facts in contextdaily IUCN endangered species news
en.wikipedia.org CC By-SA
![Page 28: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/28.jpg)
ContentMine Fact of The Day
• Fact of the day• Endangered species in recent science• Facts• Bubbles
![Page 29: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/29.jpg)
https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA
![Page 30: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/30.jpg)
“Root” 4500 papers each with 1 tree
![Page 31: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/31.jpg)
OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)
![Page 32: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/32.jpg)
Supertree for 924 species
Tree
![Page 33: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/33.jpg)
Supertree created from 4300 papers
![Page 34: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/34.jpg)
ContentMine working with Libraries
• Cambridge: Library, Plant Sciences, Epidemiology, Chemistry
• Cochrane Collaboration on Systematic Reviews of Clinical Trials
• FutureTDM (H2020, LIBER)• Running workshops and training
• Offers services for information extraction and indexing for born-digital documents.
![Page 35: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/35.jpg)
CM Future
• Hypothes.is use ContentMine results for annotation• (with Cambridge Univ Library) extracting daily scientific
facts from open and closed literature.• with EBI, Cochrane Collaborations, JISC, OKF, LIBER,
TGAC/JohnInnes, DNADigest.• Running workshops, hackdays.• Planned outreach: MEPs, EC, Slashdot, Reddit,
Kickstarter, geekdom
• http://contentmine.org (OpenLock non-profit)
![Page 36: Text and Data Mining explained at FTDM](https://reader035.vdocuments.site/reader035/viewer/2022062904/587fc08b1a28ab3b158b5017/html5/thumbnails/36.jpg)
The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011
http://contentmine.org