wikifactmine for plant chemistry
TRANSCRIPT
WikiFactMineforPhytochemistry
TomArrow1,CharlesMa;hews1,JennyMolloy1,2,RossMounce1,2PeterMurray-Rust1,3,RichardSmith-Unna1,2,LarsWillighagen11TheContentMine,Cambridge,CB42HY,2DeptofPlantSciences,UniversityofCambridge,3DeptofChemistry,UniversityofCambridge
MiningthescienGficliteratureforfacts
Allso&ware(Apache2)andData(CC0)areOpen.h9p://github.com/ContentMine.ContentMine.orgisanot-for-profitUKcompany.WethankTheShu9leworthFoundaMonforaFellowshiptpPMRandTheWikimediaFoundaMonforfundingforTAandCM.Contactpeter@contentmine.org
ContentMineandWikidataWikidatais“Wikipediaformachines”andsupportsContentMine’sFullContentsearchoftheBioscienceliterature.WegobeyondkeywordstoautomaGcallygeneratedstructureddicGonarieswiththousandsoftermsandaliases.FullContentmeansnotjustwords,butstructureddocuments,tablesanddiagrams.We(andyou)cansearchthewholeliterature(viaEuropePMCorCrossref)everydayautomaMcallyorretrospecMvelyforyoursub-areasofinterest.Example:Findfactsaboutterpenesemi;edbyconifersinIndonesia.Weautogenerate3largedicMonariesforallterpenes,conifersandIndonesianplace/islandnamesinWikidata.
IntroducGonUnderstandingphytochemicaldiversityandmetabolismcananswermanyimportantscienMficquesMonsandprovideeconomicallyimportantinformaMon;formingthefoundaMonformetabolicengineeringofplantcompounds.PhytochemicaldatabaseresourcesexistbutmuchinformaMonontheirassociaMonwithspecies,enzymesandplaceswithoutthestandardisedformatandmetadatarequiredtoenablemachineanalysis.Insomecasesitispainstakinglyextractedmanually,butthisapproachisnotscalable.Semi-automatedextracMonofphytochemicaldataacrossthefull-textopenaccessliteratureisanMcipatedtosignificantlyextendpreviousabstract-onlycoverage.Herewepresentanopensourcepipelineandpreliminaryresultsforterpenedatamining.
ReusableWikiFactMineDicGonaries.WeexpandtheWikidatatermterpeneautomaMcallyto~450items(suchascarvone)giving>1000precisesearchtermsanddata.SimilarlyinafewsecondswecangeneratedicMonariesofconifers(1899);andIndonesianislands(6344)makingbroadqueriesprecise.
SearchStrategies.(A)Dailysearch.AllnewOpenpublicaMons(300-1000)onEuropePMCaredownloadedtoWikimediaLabs,indexedbydicMonaries,andtheextractedfacts(dicMonaryhits)storedinZenodo(CERN’sOpenrepository).Eachpapermayhavehundredsorthousandsoffacts.(B)On-Demand.Aresearcher,especiallythosedoingsystemaGcreviews.createsafairlygeneralqueryinherfieldwitharangeofdates,journals,etc.anddownloadspapers(getpapersandquickscrape).Thepapersarefilteredlocallywithamuchmoreprecisequery(norma/ami).
ResearcherFileStore
PublisherSites
Tidying(PDF)TaggingScienceSearchDataSearch
AutomaBcallyExtractedIndexedFacts
getpapers quickscrape
DicBonarySearch
Diseases Drugs
Phytochem
Species
Norma/ami
Text
Figures
Genes
Data
ResearcherFileStoreB
Alldaily
30,000pages/day
AA.AllEPMCpapersaredownloadedeverydayandthefactsareextractedintoZenodoandmadepubliclyavailable.
B.Researchersearchesrepositoriesandalsoscrapespublishersitesforwhateverchunkoftheliteratureshewants.SherunslocaldicMonariesandsavestheresultstodiskwheretheycanbefurtheranalyzed.Shecanaddanypapersshehaslegalaccesstoandre-runwheneverrequired.E.g.BagOfWordsisapowerfultoolforclassifyingpapers
(Bio)chemicaltransformaGons PhylogeneGcs
A.DiagramsofChemicalandbiochemicalreacMonscanbeautomaMcallyextractedfromPDFsintotheResearcher’sfilestore.
B.PhylogeneMctreescanbeautomaMcallyextractedfrombitmapdiagramsorPDFs,andspeciesnamesverified.Mounce,Murray-Rust,Wills:h9p://doi.org/10.3897/rio.3.e13589
Tablesandgraphs
C.TablesandgraphscanbeautomaMcallyextractedintoresearcher’sfilestoreandturnedintoCSVtablesorspectra.Designedforre-usewithyourfavouritetools(R,Python,etc.)
INTELLIGENTQUERIES
INTELLIGENTCONTENT