machines are people too
TRANSCRIPT
MACHINES ARE PEOPLE TOO
Dr. Paul Groth | @pgroth | pgroth.com
Disruptive Technology Director
Elsevier Labs | @elsevierlabs
Theory and Practice of Digital Libraries 2017
THANKS FOR CONVERSATION & SLIDES!
Riffing off of Brad’s Dublin Core
2016 keynote
https://www.slideshare.net/bpa777/
dc2016-keynote-20161013-
67164305
THE SUCCESS OF DIGITAL LIBRARIES
“Live every day like it's NBER day”
THE SUCCESS OF DIGITAL LIBRARIES
THE SUCCESS OF DIGITAL LIBRARIES
THE SUCCESS OF DIGITAL LIBRARIES
THE SUCCESS OF DIGITAL LIBRARIES
THE NEXT MEDIA: DATA
FAIR EVERYWHERE
RESEARCH DATA MANAGEMENT
DATA SEARCH
Antony Scerri, John Kuriakose, Amit Ajit Deshmane, Mark Stanger, Peter Cotroneo, Rebekah Moore, Raj Naik, Anita de Waard;
Elsevier’s approach to the bioCADDIE 2016 Dataset Retrieval Challenge, Database, Volume 2017, 1 January 2017,
bax056, https://doi.org/10.1093/database/bax056
THE CENTRALITY OF THE USER
HOW DO RESEARCHERS SEARCH FOR DATA?
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A.,
& Wyatt, S. (2017). Searching Data: A Review of
Observational Data Retrieval Practices. arXiv
preprint arXiv:1707.06937.
Some observations from @gregory_km
survey:
1. The needs and behaviours of specific user groups
(e.g. early career researchers, policy makers,
students) are not well documented.
2. Background uses of observational data are better
documented than foreground uses.
3. Reconstructing data tables from journal articles,
using general search engines, and making direct data
requests are common.
BUT ARE WE MISSING A USER?
WHY MACHINES?
ELSEVIER’S BUSINESS: PROVIDING ANSWERS FOR RESEARCHERS, DOCTORS AND NURSES
My work is moving towards a new field; what should I know?
• Journal articles, reference works, profiles of researchers, funders & institutions
• Recommendations of people to connect with, reading lists, topic pages
How should I treat my patient given her condition & history?
• Journal articles, reference works, medical guidelines, electronic health records
• Treatment plan with alternatives personalized for the patient
How can I master the subject matter of the course I am taking?
• Course syllabus, reference works, course objectives, student history
• Quiz plan based on the student’s history and course objectives
INFORMATION OVERLOAD
WHAT CAN MACHINE INTELLIGENCE DO TODAY?
If there’s a task that a normal person can do with
less than one second of thinking, there’s a very
good chance we can automate it with deep
learning.
Andrew Ng, Chief Scientist, Baidu (lecture at Bay Area Deep Learning
School, Stanford, CA, September 24, 2016)
HUMAN SPEECH RECOGNITION
Was 23% in 2013, and over 35% in 2012.
https://venturebeat.com/2017/05/17/googles-speech-recognition-technology-now-has-a-4-9-word-error-rate/
IMAGE RECOGNITION
https://devblogs.nvidia.com/parallelforall/author/czhang/
THESE RESULTS ARE DRIVEN BY DATA
“The paradigm shift of the ImageNet
thinking is that while a lot of people
are paying attention to models, let’s
pay attention to data, …”
– Prof. Fei-Fei Li [1]
[1] The data that transformed AI research—and possibly the world
https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-
possibly-the-world/
THE GROWTH IN DATA ENGINEERS
https://www.stitchdata.com/resources/reports/the-state-of-data-engineering
BUT DO DIGITAL LIBRARIES HELP MACHINES?
• Machines’ proficiency in learning to answer questions from text, audio,
images and video will depend on our ability to train them effectively to read
information from the Web
• How machines read the Web today
• Crawling and indexing Web resources, possibly semantically tagged
(e.g. using schema.org)
• Find-and-follow crawling of open linked data resources for ontology and
data sharing and reuse
• Programmatic access to APIs mediated through HTTP/S and other
Internet protocols
DIGITAL LIBRARIES & LINKED DATA STANDARDS
THE SEMANTIC WEB WAS INTENDED FOR MACHINE READING
… that’s the real idea behind the Semantic Web:
letting software use the vast collective genius
embedded in its published pages.
Swartz, A. (2013). Aaron Swartz's A programmable Web: An unfinished
work. San Rafael, Calif.: Morgan & Claypool Publishers.
BUT THE SEMANTIC WEB IS BUILT FOR PEOPLE, NOT MACHINES
• The Semantic Web is largely a logicist take on the way knowledge is to be
represented
• The latest advances in machine intelligence are based on a connectionist
approach to knowledge representation
• There is a gap between how knowledge is represented in the Semantic Web
and what deep learning is exploiting to such good effect
• The Semantic Web is silent about how machines can become better
readers, and hence better partners in the second machine age
• How will we evolve metadata standards to better accommodate machines?
MACHINE READING IS ENABLED BY MACHINE LEARNING
input
output
algorithm
input
output
model
learning
architecture
data
Programming
Machine learningGPU
CPU
CPU
MACHINES SEE THINGS DIFFERENTLY THAN PEOPLE
From: Alain, G. and Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644v1.
MACHINES LEARN THINGS DIFFERENTLY THAN PEOPLE
VOCABULARIES ARE SETS OF VECTOR EMBEDDINGS
From: Eisner, B., Rocktäschel, T., Augenstein, I., Bošnjak, M. and Riedel, S. (2016). Emoji2vec: learning emoji representations from their description. arXiv:1609.08359v1.
TRAINING DATASETS ARE GROWING IN VOLUME AND COVERAGE
From: Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B. and Vijayanarasimhan, S. YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675.
MODELS ARE BECOMING REUSABLE DATA RESOURCES
Check out: sujitpal.blogspot.com for more
MACHINE LEARNING DATASETS AND MODELS ARE BECOMING PART OF THE WEB
• Machines need lots and lots of data to learn how to read
• Datasets with ad-hoc formats are being made openly available
• Open Images “~9 million URLs to images that have been annotated with labels spanning over 6000 categories” (The Open Images Dataset.
(n.d.). Retrieved September 29, 2016, from https://github.com/openimages/dataset.)
• YouTube-8M : “8 million YouTube video URLs (representing over 500,000 hours of video), along with video-level labels from a diverse set of
4800 Knowledge Graph entities” (Vijayanarasimhan S. and Natsev, P. (2016). Announcing YouTube-8M: A Large and Diverse Labeled Video
Dataset for Video Understanding Research. Retrieved September 29, 2016, https://research.googleblog.com/2016/09/announcing-youtube-8m-
large-and-diverse.html.)
• Stanford Natural Language Inference: “570k human-written English sentence pairs manually labeled for balanced classification with the
labels entailment, contradiction, and neutral, supporting the task of natural language inference” (The Stanford Natural Language Inference
(SNLI) Corpus. (n.d.). Retrieved September 29, 2016, from http://nlp.stanford.edu/projects/snli/.)
• Standard architectures for machine (deep) learning are being released as open source
• Dense neural networks for classification
• Convolutional neural networks for image, audio and video recognition
• Recurrent neural networks for sequence processing and generation
• Advances in the field are being published quickly and transferred to industrial application just as
quickly
THE OPPORTUNITY FOR LIBRARIANS AND PUBLISHERS
As machines become increasingly capable of general-
purpose language understanding, the burden of effort in
building machine intelligences will shift from software
engineering to the acquisition, organization and curation
of training content and data.
THE ROLE OF METADATA IN THE SECOND MACHINE AGE – DC-2016 / KØBENHAVN / 13 OCTOBER
SAVE THE TIME OF THE MACHINE READER
Perhaps this law is not so self-evident as the others.
None the less, it has been responsible for many
reforms in library administration and has a great
potentiality for effecting many more reforms in the
future.
Ranganathan, S.R. (1931). The five laws of library science. Madras: The
Madras Library Association.
IMAGE SOURCE: HTTP://WESTPORTLIBRARY.ORG/ABOUT/NEWS/ROBOTS-ARRIVE-WESTPORT-LIBRARY
WHAT DOES IT LOOK LIKE TO HAVE MACHINES AS
LIBRARY PATRONS?
Tasks
1. Dataset / Model / Vocabulary Curation
2. Combating Bias
3. Explanation
4. Interoperability
5. Data Narratives
DATASET CURATION
MODEL CURATION
VOCABULARY CURATION
BATTLING BIAS
BATTLING BIAS: ALGORITHMIC LITERACY
Algorithms all have their own ideologies. As computational
methods and data science become more and more a part of
every aspect of our lives, it is essential that work begin to ensure
there is a broader literacy about these techniques and that
there is an expansive and deep engagement in the ethical
issues surrounding them.”
– Trevor Owens (Library of Congress / Former IMLS)
http://www.pewinternet.org/2017/02/08/theme-7-the-need-grows-for-algorithmic-literacy-transparency-and-oversight/
THE RIGHT TO AN EXPLANATION
“The data subject shall have the right to obtain … the
existence of automated decision-making, including profiling
… meaningful information about the logic involved, as
well as the significance and the envisaged consequences
of such processing for the data subject.”
EU General Data Protection Chapter 3, Article 15
PROVENANCE FOR EXPLANATION
Credits: Curt Tilmes, Peter Fox
Tilmes, C.; Fox, P.; Ma, X.; McGuinness, D.L.; Privette, A.P.; Smith, A.; Waple, A.; Zednik, S.; Zheng, J.G.,
"Provenance Representation for the National Climate Assessment in the Global Change Information System,"
Geoscience and Remote Sensing, IEEE Transactions on , vol.51, no.11, pp.5160,5168, Nov. 2013
NATIONAL CLIMATE CHANGE ASSESSMENT
PROVENANCE
INTEROPERABILITY
DATA NARRATIVE GENERATION
Towards Automating Data Narratives.
Gil, Y.; and Garijo, D. In Proceedings of the
Twenty-Second ACM International Conference
on Intelligent User Interfaces (IUI-17),
Limassol, Cyprus, 2017.
THE CHALLENGE: DIGITAL LIBRARIES FOR MACHINES
• Digital Libraries have made tremendous strides in making media available
• The investment in Linked Data and APIs has made integration and building
applications easier and can help machine reader use cases
• But a new user needs new support:
• new forms of media (models, data)
• new vocabulary representations
• new forms of transparency
• new ways to interoperate
• new mechanisms to communicate
• ….
THANK YOU
Dr. Paul Groth | @pgroth | pgroth.com
labs.elsevier.com