bridging digital humanities research and big data repositories of digital text
DESCRIPTION
Keynote, 2014 Encuentro de Humanistas Digitales, Mexico CityTRANSCRIPT
Bridging Digital Humani/es Research and Large Repositories of Digital Text
2nd Encuentro de Humanistas Digitales | 21.May.14 Biblioteca Vasconcelos, Mexico City
Beth Plale
Professor, School of Informa/cs and Compu/ng Director, Data To Insight Center
Indiana University
Tweet us -‐ @HathiTrust #HTRC
HATHI TRUST RESEARCH CENTER!
SeHng Stage • “InformaLcs” is the applicaLon of computer and informaLon science (CIS) to the data that consLtutes the primary research material of that field.
• In Europe, digital humaniLes is someLmes called “cultural informaLcs”, but that misses point that informaLcs researcher brings CIS methodologies to problems in humaniLes, whereas DH researchers bring humaniLes methodologies to problems.
• I am an informaLcs researcher (CIS methodologies) with 15 year record in geo-‐informaLcs, and over last 5 years, a growing understanding of methodology and moLvaLons of the digital humaniLes researcher
Digital humani,es is an emerging discipline that applies computaLon to research in the humaniLes. More than simply conducLng research with computers, digital humaniLes scholars use informaLon technology as a central part of their methodology.
University of Illinois Library web site, 2014
Digital HumaniLes acLviLes categorized
• Access: big part of what [digital humaniLes scholar] does is study cultural heritage materials -‐ books, newspapers, painLngs, film, sculptures, music, ancient tablets, buildings, etc. Pre\y much everything on that list is being digiLzed in very large numbers.
• Produc/on: we're already seeing more and more scholars producing their work for the Web. It might take the form of scholarly websites, blogs, wikis, or whatever. […] the enLre producLon cycle uses technology (collecLng, ediLng, discussing with others) before the final product is created.
• Consump/on: people get their materials in all kinds of new ways. Reading has changed with the Web. The way we read is changing. Bits and pieces of varied content from so many places and perspecLves.
Interview with Bre\ Bobley, NEH, 2009 h\p://www.hastac.org/node/1934
Why does it ma\er? “If I had to predict some interesLng things for the future in the area of access, I'd sum it up in one word: scale. Big, massive, scale. That's what digiLzaLon brings -‐ access to far, far more cultural heritage materials than you could ever access before.”
2009 interview with Bre\ Bobley, Nat’l Endowment of HumaniLes, US, on predicLons for the future for Digital HumaniLes
Bobley’s PredicLon, cont.
In a world of big, massive scale, he asks: • “How might quanLtaLve technology-‐based methodologies like data mining help you to be\er understand a giant corpus? Help you zero in on issues?”
• “What if you are a historian and you now have access to every newspaper around the world?”
• “How might searching and mining that kind of dataset radically change your results?”
Goal of Talk
Introduce technical architectural big data developments around HathiTrust, emerging
examples of use,
… to facilitate discussion around whether Bre\ Bobley’s 2009 predicLon of “scale. Big, massive, scale”, which is here today, can now deliver on
advances for digital humaniLes
#HTRC @HathiTrust
HathiTrust
• HathiTrust is a consorLum of academic & research insLtuLons, offering a collecLon of millions of Ltles digiLzed from libraries around the world. – Founding members: University of Michigan, Indiana University, University of California, and University of Virginia
http://www.hathitrust.org/htrc
http://www.hathitrust.org
à DisLnguished from
#HTRC @HathiTrust
#HTRC @HathiTrust
Content of HathiTrust
• Books and journals – Plus pilots around images, audio, born-‐digital
• DigiLzaLon sources – Google (96.8%, 10,162,104) – Internet Archive (2.9%, 301,972) – Local (0.3%, 31,840)
#HTRC @HathiTrust
Content Sources
#HTRC @HathiTrust
Content distribuLon
360,000 volumes in Spanish
#HTRC @HathiTrust
Mo/va/on for HTRC
à HathiTrust repository is massive scale -- latent goldmine for text based research à Restricted nature of parts of HathiTrust content suggests need for new forms of access that preserves intimate nature of interaction with texts while at same time honoring restrictions on access à Size and restrictions demand new paradigm: computation moves to the data (not vice versa)
#HTRC @HathiTrust
HathiTrust Research Center
• The HathiTrust Research Center (HTRC) was established in 2011 to enable computaLonal research across a comprehensive body of published works, for the purposes of scholarship, educaLon, and invenLon.
• HTRC ExecuLve Commi\ee – Beth Plale, co-‐Director, Professor of InformaLcs and CompuLng, Indiana University
– J. Stephen Downie, co-‐Director, Professor of InformaLon Science, University of Illinois
– Robert McDonald, Indiana University Libraries – Beth Namachchivaya Sandore, University of Illinois Library – John Unsworth, CIO, Dean of Library, Brandies University
HTRC system
Complexity hiding interface
The complexity
Tabular info
StaLsLcal plots
SpaLal plots
Request
Complexity
hiding interface
Return to categories of DH acLvity HTRC in current form best at suppor/ng: • Access: by narrowing down to essenLal materials quickly –
separaLng wheat from chaff “big part of what [digital humaniLes scholar] does is study cultural heritage materials -‐ books, newspapers, painLngs, film, sculptures, music, ancient tablets, buildings, etc.”
• Produc/on: by supporLng computaLonal invesLgaLon over massive scale of texts that will require large-‐scale computers (cloud compuLng)
• Consump/on: by tracking the bits and pieces (i.e., the HTRC workset) “The way we read is changing. Bits and pieces of varied content from so many places and perspecLves.”
Interview with Bre\ Bobley, NEH, 2009
Workset manages engagement with texts
EXAMPLES OF RESEARCH THAT IS POSSIBLE AT SCALE
• Topic modeling • Author Gender IdenLficaLon • Using Topic Modeling to Locate (down to sentence
level) Philosophical Arguments in Science Texts
#HTRC @HathiTrust
Topic Modeling
• Can answer more complex or nuanced quesLons – What are the primary themes of an author? – What are the primary themes of a research domain?
– When did a new topic enter a research domain? • Provides more data than word counts
– 100s of topics can be extracted. – Underlying data (topics, volume, and page) is available
#HTRC @HathiTrust
Themes for Authors Two topics with idenLcal centraliLes (e.g., Dickens) but separate themes
More strongly focused on book (illustraLons, volume, literature)
More strongly focused on author himself (le\ers, household, house)
Ted Underwood, Univ of Illinois
GENDER IDENTIFICATION OF HTRC AUTHORS BY NAMES
Stacy Kowalczyk, Asst. Professor, Dominican University Zong Peng, HTRC, Indiana University
Talk by Stacy Kowalczyk, h\p://www.hathitrust.org/htrc_uncamp2013
#HTRC @HathiTrust
Gender IdenLficaLon of Text
• QuesLon InvesLgated: Can we use author names in bibliographic records to idenLfy gender?
• Looked at 2.6 million bibliographic records – Extracted personal author data – Marc 100 abcd and 700 abcd
• 606,437 unique personal author strings • Bibliographic data is not fielded like patent names • Relying on Standard cataloging pracLce
– Last name, first name middle name, Ltles/honorifics, dates
#HTRC @HathiTrust
Authors vs Names There is the author, then there are the names under which the author is published… • Methuen, Algernon Methuen Marshall, Sir bart., 1856-‐1924 • Methuem, Algernon • Methuen Algernon • Methuen Marshall, Sir, bart., 1856-‐ • Methuen, A. Sir, 1856-‐1924 • Methuen, A. Sir, bart., 1856-‐1924 • Methuen Marshall, Sir bart 1856-‐1924 • Methuen, Algernon Methuen Marshall, Sir, 1856-‐1924 • Methuen, Algernon Methuen Marshall, Sir, bart.,
1856-‐1924 • Methuen, Algernon, 1856-‐1924
#HTRC @HathiTrust
Sources of Data • The Virtual InternaLonal Authority File
– Hosted by OCLC • Harvested names from mulLple data sources
– Census bureau – Baby name sites
• EU Patent Research names list (Frietsch et al, 2009; Naldi et al. 2005) – Developed an extensive list of European names
• Titles and honorifics – MulLple web resources – Sir, Baron, Count, Duke, Father, Cardinal, etc – Lady, Mrs. Miss, Countess, Duchess, Sister, etc
#HTRC @HathiTrust
IniLal Gender Results
• Approximately 80% of name strings have iniLal gender idenLficaLon – Female
• 59,365 • 10%
– Male • 425,994 • 70%
– Unknown • 114,204 • 19%
– Ambiguous • 5,965 • Less than 1%
#HTRC @HathiTrust
Results by Data Source
Against the whole set of name strings • VIAF
– 19% hit rate • Web Names
– 54% hit rate • Patents Names
– 8%
Colin Allen, Jamie Murdock Cogni/ve Science, Indiana University
Ref talk by Jamie Murdock, h\p://www.hathitrust.org/htrc_uncamp2013
Digging into philosophy of science
• Establish points of contact between philosophy and science: where philosophical arguments on anthropomorphism appear in science texts
• Use topic modeling to idenLfy the volumes and pages within these volumes that are “rich” in a chosen topic
• Use semi-‐formal discourse analysis technique to idenLfy key arguments in selected pages to incrementally expose and represent argument structures
The How
• 1315 volumes from HTRC selected using keyword search for ‘darwin’, ‘romanes’, ‘anthropomorphism’, and ‘comparaLve psychology’
• Set contains lots of uninteresLng books: e.g., college course catalogs
• Apply topic modeling on 86 volume subset • Using iPy Notebook
Volume level topic modeling on ‘anthropomorphism’ yields set of
topics
.. Of set of topics, choose ‘16’ as best
Volumes most similar to topic 16
Repeat topic modeling at page level
Topic model at page level for topics anthropomorphism, animal, and psychology
Pick top 3: topics 16, 10, 26
Show documents of topics 10, 16, 26
Drop to sentence level
• Select three books* with highest aggregate of 20-‐40 topic-‐relevant pages for more precise analysis
• Model the three books at the sentence level (uses machine learning)
* Start from 1315 texts to start, down to 86, then down to most relevant 3
Promising early results …
Copyright: A Reality Full text download is limited by both
size and by copyright
#HTRC @HathiTrust
CompuLng with Copyrighted materials: HTRC Data Capsule
• Copyrighted materials can be computed on, but cannot be shared by humans for human (reading) consumpLon
• Needs computaLonal framework to enable compuLng but restricLng human consumpLon
• A secure compuLng framework that: – Trusts that researcher will not deliberately leak data – Prevents malware acLng on user's behalf from leaking data.
• Supports Openness: accepts user-‐contributed analysis • Supports Large-‐scale and low cost: protecLons can be
extended to uLlizaLon of public supercomputers
VM Image Manager
VM Image Store
VM Image Builder
VM Manager
VM instance
Secure Capsule cluster
SSH Research results
Researcher
HTRC Data Capsule Architectural Components
Registry Services, worksets
VM Image
Manager
VM Image Store
VM Image Builder
VM Manager
VM instance
Upon run, Secure Capsule:
controls I/O behind scenes
SSH Research results
Researcher
HTRC Data Capsule interacLon
Researcher requests new VM of type X
Researcher install tools onto VM through window on her desktop.
Registry Services, worksets
Final locaLon of results is registry
1)
2)
Image instance is created
3)
4)
47
HTRC secure data capsule: view from researcher desktop
Thanks to our sponsors
2009: “If I had to predict some interesLng things for the future in the area of access, I'd sum it up in one word: scale. Big, massive, scale. That's what digiLzaLon brings -‐ access to far, far more cultural heritage materials than you could ever access before.”
à Paradigm: computation moves to the data (not vice versa)
2014: We are at massive scale of data, but data access is constrained. Can digital humani/es
researchers work within constraints? Will they find it worthwhile to do so?
Reality: Full text download is limited by size and copyright