text mining: the next data frontier. beyond open access
TRANSCRIPT
Presentation’s Subtitle
#openminted_eu
beyond Open Access
Text Mining: the next data frontier
Natalia ManolaAthena Research & Innovation Centre
OpenCon Satellite Berlin, 25 Nov 2016
A few sobering facts on content production
OpenCon Satellite Berlin, 25 Nov 2016
● 1,8 billion websites & 3,46 billion internet users, on 25 September 2016.
● 24 million wireless sensors and actuators worldwide (553% up, between 2011and 2016)
● 16 zettabytes of useful data (16 Trillion GB) by 2020
● YouTube claims to upload 24 hours of video every minute, making the site ahugely significant data aggregator.
● Every second, on average, around 6,000 tweets are tweeted on Twitter, whichcorresponds to over 350,000 tweets sent per minute, >500 million tweets perday and around 200 billion tweets per year.
● 74,200,000 pages existed on Facebook, with 7 million apps and websitesintegrated with Facebook on 30/5/2016
2
… And some facts on scientific literature
OpenCon Satellite Berlin, 25 Nov 2016
The global research community generates ~2.5 million new scholarly articles per year (English only)
The STM report (2015)
… some 90% of papers … are never cited (82% in the humanities)… of those articles that are cited, only 20 percent have actually been read… 50% of papers are never read by anyone other than their authors, referees and journal editors
Lokman I. Meho, The rise and rise of citation analysis, 2007
… one paper published every 12seconds… 70,000 papers published on a single protein, the tumor suppressor p53
Spangler et al, Automated Hypothesis Generation based on Mining Scientific Literature, 2014
3
How can we make sense of this data?
OpenCon Satellite Berlin, 25 Nov 2016
4
Emerging solutions
Machine readingprocess textual sources, organise and classify in various dimensions, extract main (indexical) information items,
… and “understanding” identify and extract entities and relations between entities, facilitate the transformation of unstructured textual sources into structured data
… and predictingenable the multidimensional analysis of structured data to extract meaningful insights and improve the ability to predict
OpenCon Satellite Berlin, 25 Nov 2016
5
However, …Multitude of solutions catering for different
Text Types NewswireScientific LiteratureTweets/blogsPatentsClinical/medical recordsTextbooks, monographsOnline forums….
LanguagesEnglish French GermanSpanishPortugueseItalianPolish….
TasksTranslationInformation ExtractionSemantic SearchQuestion AnsweringSentiment AnalysisSummarizationKnowledge Discovery….
DomainsFinance/BusinessHealthBiologySocial SciencesHumanities….
Creating a fragmented landscape
OpenCon Satellite Berlin, 25 Nov 2016
6
A glimpse on the TDM landscape
OpenCon Satellite Berlin, 25 Nov 2016
7
Resource: FutureTDM project (www.fututetdm.eu)
What can we do?
8
1. Share content• Document literature content• Share in a meaningful way: what does Open Access really mean?
IPR and licensing• Study IPR restrictions for reuse of sources as well as possible exceptions• Promote clarity and standardisation of legal rights and obligations
Challenges• Rights statement vs. Open licenses (for repositories)• No access to full text. We live in a metadata world• No standard protocols, formats and APIs for access and retrieval• No capacity to handle extra traffic
OpenCon Satellite Berlin, 25 Nov 2016
9
Proposed solution : Make TDM enabled hubs
OpenCon Satellite Berlin, 25 Nov 2016
10
Literature Repositories
OA Journals
Data Repositories
Aggregators
ArchivesMetadata
Full textData
OpenAIRE
CORE
PMC Europe
…
Guidelines APIs
TDM
Research networks
WIkiPedia/Media/Research
…
2. Share TDM Services• Document language processing/text mining services and workflows in a
meaningful way for domain discipline researchers• Document language/knowledge resources, data categories taxonomies,
provenance information
Interoperable services• Common way of presenting annotated results• Combine services into workflows• Combine content and language resources with services and workflows• Combine automatic and manual/crowdsourcing annotation services
IPR and licensing• Translate the legal & policy aspects into specifications for lawful user-to-
service and service-to-service interactions
Challenges• Bring text miners close to the researcher problems and needs• Semantic interoperability (not just technical)
OpenCon Satellite Berlin, 25 Nov 2016
11
OpenMintedEstablish an open and sustainable Text and Data Mining (TDM) platform and infrastructure where
researchers can discover, collaboratively create, share and re-use knowledge from a wide range of text based
scientific and scholarly related sources.
OpenCon Satellite Berlin, 25 Nov 2016
12
A step from Open Access to Open Science
HIGH LEVEL ARCHITECTURE
OpenCon Satellite Berlin, 25 Nov 2016
13
Policies & guidelines
Register and Discover TDM Services and tools
Link to Content hubs
Run a TDM job and share results
Get people’s knowledge - Crowdsourced Annotation
Our Services
14
OpenCon Satellite Berlin, 25 Nov 2016
Build your own service – Combine components into a Workflow and SHARE
Our UsersEnd users• Researchers, data base curators, Research Infrastructure
operators• Novice: use services to advance their science• Advanced: use TDM components into complex workflows
OpenCon Satellite Berlin, 25 Nov 2016
15
Content and service providers- Publishers, libraries, scientific data base centres, …- TDM researchers- SMEs
OpenCon Satellite Berlin, 25 Nov 2016
Scholarly Comm.Feature extractionData citationResearch analytics
Life Sciences
Curation of databases and lexica in Chembolomics &neuroinformatics
Agriculture
Extracting information from tables for food safety alerts
Social Sciences
Data citation
Community Driven
16
From the very beginning…Requirements, content, barriers, expected outcomes.
… to the very end Create applications, validate and evaluate the results.
Examples of OpenAIRE TDM services we want to share
17
@openaire_eu
18
Discover research in context
OpenCon Satellite Berlin, 25 Nov 2016
19
Research Trends and correlations
Text and data mining with domain specific knowledge
Interactive visualization for drill-down information
…
Trends in science
Correlations of funding programmes
Within a funder, oracross countries
OpenCon Satellite Berlin, 25 Nov 2016
What will it look like?
20
the openminted registry
OpenCon Satellite Berlin, 25 Nov 2016
21
Browse tdm resources & tools/services
OpenCon Satellite Berlin, 25 Nov 2016
22
Register, document, share tools
OpenCon Satellite Berlin, 25 Nov 2016
23
Create your corpus, annotate, share
OpenCon Satellite Berlin, 25 Nov 2016
24
How does this all bind together?
OpenCon Satellite Berlin, 25 Nov 2016
25
OpenAIRE
CORE
CrossRef
… OpenMinted REGISTRY
CLARIN
META-SHARE
OpenMinted WORKFLOWS
TDM TOOLSRepositories
(OA) Journals
Other textual resources e.g. medical records, PSI
How DOES open Science help?
Language resources
…
What’s next
Participate with your ideas• Give us your feedback on our pending guidelines and APIs• Provide us with your TDM requirements – we have the
experts to consult you• Register your TDM services• Test out the system when it comes live (spring)
Watch out for• OpenAIRE’s datathons, tenders and challenges (60K in total)• OpenMinTeD’s tenders and challenges (240K in total)
OpenCon Satellite Berlin, 25 Nov 2016
26
twitter.com/openminted_eu
facebook.com/openminted
bit.do/openmintedlinkedin
vimeo.com/openminted
bit.do/openmintedplus
THANK YOU!
Natalia [email protected]
twitter.com/openminted_eu
facebook.com/openminted
bit.do/openmintedlinkedin
vimeo.com/openminted
bit.do/openmintedplus27