the role of “big data” in scientific publishing
DESCRIPTION
The Role of “Big Data” in Scientific Publishing. Bradley P. Allen Chief Architect, Elsevier Presentation for panel on “Giving Voice to Content: Emerging Technologies” NFAIS 56 th Annual Conference Philadelphia, PA, USA 2014-02-24. Why the scare quotes?. - PowerPoint PPT PresentationTRANSCRIPT
1
The Role of “Big Data” in Scientific Publishing
Bradley P. AllenChief Architect, ElsevierPresentation for panel on “Giving Voice to Content: Emerging Technologies”NFAIS 56th Annual ConferencePhiladelphia, PA, USA2014-02-24
2
Reference: http://ajharmony.tumblr.com/post/65901268958/mostlysignssomeportents-big-data-is-like, from a quote by Dan Ariely in https://www.facebook.com/dan.ariely/posts/904383595868
Why the scare quotes?
3
How large is the amount of data your organization currently manages to produce its online products and services?1. Gigabytes2. Terabytes3. Petabytes4. Exabytes
Audience poll: current data scales
4
Scientific content in the context of big data
5
• Scientific publishing is the act of compressing a universe’s worth of data into small pieces of content that people can consume
• In essence, this is the ultimate big data problem
• But it is one in which until recently publishers have played a very simple role
• That is beginning to change
What does big data mean to scientific publishing?
6
• Create more useful content by enhancing it with data extracted from content
• Make the researcher’s life better by exploiting data about how content is used to improve her experience of using our online applications
• Enable research itself by supporting the care and feeding of experimental data at scale
What are we beginning to do with big data?
7
Which of these uses of big data is most important for your organization?1. Extracting data from content2. Improving user experience through usage
analytics3. Managing experimental data4. All of the above5. None of the above
Audience poll: big data use cases
8
Sources of data in scientific publishingType of data Inputs Outputs Benefits
Data extracted from content
XMLLong-form free text Short-form free textTablesImagesVideoAudio
Asset metadata CitationsClassificationsClustersEntitiesRelationsLanguage modelsProbabilistic graphical models
Advances scientific understandingProvides publishers with raw material for linking content with task-specific solutions
Data about how content is used
Article viewsSearch queriesUser behaviorSocial media streams
Article-level metricsSentiment analysisRanking and impact metricsUser interest profilesCollaborative filtering models
Provides the researcher insight about her careerProvides institutions data about their performance and impactProvides publishers with data for optimizing content delivery
Experimental Data
Sensor and instrumentation feedsCrowdsourced data (e.g. user surveys)
Data recordsCurated datasets
Provides input to research analyticsProvides archival management of research data assets
9
Roxie
Example: collaborative filtering in ScienceDirect• When users look at articles on ScienceDirect, they are provided links to other articles of interest• Related Articles originally implemented using bag-of-words similarity using search engine query• Goal: Increase click-through rate on Recommended Articles over previous Related Articles offering;
drive usage, engagement & revenue• Pilot: Ran from March to July 2013, with 9 variants A/B tested with ~5% SD traffic A/B tested• Production: Since Aug 2013
Inputs• 5 years of SD usage data/events• All SD XML Articles • SNIP2 Journal Rankings
ThorCo-
download matrix
Similarity
Attribute Ranking
6 billion events
~12M articles
pii-739156
Daily updates
pii-684259, pii_585346, pii_491635
10
Which big data tools/platforms are you currently using?1. Apache Hadoop2. A Hadoop distribution (Cloudera, MapR,
Amazon EMR, …)3. LexisNexis HPCC4. Twitter Storm5. Rolling our own6. None of the above
Audience poll: big data tools and platforms
11
• All of these tools and platforms basically make the following easy to do– Break data up into many chunks, each of which
can fit into memory on a given machine– Send each chunk to a machine where it is
processed into chunks containing intermediate results
– Combine the intermediate results into a single aggregate data set
– Lather, rinse, repeat…
How big data infrastructure works
12
Big data technologies within Elsevier
Type of processing Timeframe Data Volume Key
Platforms Projects/Products
Batch
Minutes to hours
TBs to PBs HPCC Thor,
Hadoop
SciVal Spotlight, Scopus author profile deduplication, ScienceDirect related articles recommendation
StreamNeverending Unbounded
and continuous
HPCC Roxie, Twitter Storm
Internal content analytics and text mining tools
Ad-hoc QueryMilliseconds to minutes
GBs to PBs HPCC Roxie
ScienceDirect usage analytics
13
• Talent acquisition– What training is needed to make big data platforms usable by our existing
teams?– Who/what is a data scientist?
• Best practices and design patterns for big data– @nathanmarz’ Lambda Architecture
• The proliferation of big data platforms – HPCC, MapR, Cloudera…
• Cloud-based vs. hosted solutions– Amazon Elastic MapReduce, Redshift
• Data formats and practice for scaling ETL/ELT– Apache Avro, Google Protocol Buffers, zlib-compressed JSON
• Numerical computing frameworks for optimization– High-performance computing using GPUs
Big data technology issues (in no particular order)
14
• These technologies can yield a wealth of infrastructure, tools, workflows and business models to clone and adapt to the special circumstances of scientific publishing
• Big data can open the door to optimizing the value exchange between author, publisher and reader
• This will require us to walk away from legacy preconceptions– Ask yourself: is it this way because it was done on paper?
• A thought experiment: gold open access as computational advertising
Can we use big data to enable new business models?
15
Big data is key to computational advertising
Reference: S. Yuan, A.Z. Abidin, M. Sloan and J. Wang. Internet Advertising: An Interplay among Advertisers, Online Publishers, Ad Exchanges and Web Users. arXiv:1206.1754v1 [cs.IR] 8 Jun 2012.
16
Can big data enable computational publishing?
Authors Researchers
PublishersArticle exchanges
knowledge
article inventories
article inventories
article inventories
credit
time & focus$$$ $$$
$$
($)
The simplified ecosystem of author-pays scientific publishing. Authors spend budget to buyarticle inventories from article exchanges and publishers; article exchanges serve as matchers for articles and journals; publishers provide valuable information to satisfy and keep researchers; researchers read articles and exchange credit for knowledge from the authors. Note that normally researchers would not receive cash from publishers.
17
• Big data can play a role in creating new value for researchers and institutions
• Ways in which big data is currently exploited in the consumer Internet provide guidance for its use by scientific publishers
Summary