bridging digital humanities research and big data repositories of digital text

Bridging Digital Humani/es Research and Large Repositories of Digital Text

2nd Encuentro de Humanistas Digitales | 21.May.14 Biblioteca Vasconcelos, Mexico City

Beth Plale

Professor, School of Informa/cs and Compu/ng Director, Data To Insight Center

Indiana University

Tweet us -‐ @HathiTrust #HTRC

HATHI TRUST RESEARCH CENTER!

SeHng Stage •  “InformaLcs” is the applicaLon of computer and informaLon science (CIS) to the data that consLtutes the primary research material of that field.

•  In Europe, digital humaniLes is someLmes called “cultural informaLcs”, but that misses point that informaLcs researcher brings CIS methodologies to problems in humaniLes, whereas DH researchers bring humaniLes methodologies to problems.

•  I am an informaLcs researcher (CIS methodologies) with 15 year record in geo-‐informaLcs, and over last 5 years, a growing understanding of methodology and moLvaLons of the digital humaniLes researcher

Digital humani,es is an emerging discipline that applies computaLon to research in the humaniLes. More than simply conducLng research with computers, digital humaniLes scholars use informaLon technology as a central part of their methodology.

University of Illinois Library web site, 2014

Digital HumaniLes acLviLes categorized

•  Access: big part of what [digital humaniLes scholar] does is study cultural heritage materials -‐ books, newspapers, painLngs, film, sculptures, music, ancient tablets, buildings, etc. Pre\y much everything on that list is being digiLzed in very large numbers.

•  Produc/on: we're already seeing more and more scholars producing their work for the Web. It might take the form of scholarly websites, blogs, wikis, or whatever. […] the enLre producLon cycle uses technology (collecLng, ediLng, discussing with others) before the final product is created.

•  Consump/on: people get their materials in all kinds of new ways. Reading has changed with the Web. The way we read is changing. Bits and pieces of varied content from so many places and perspecLves.

Interview with Bre\ Bobley, NEH, 2009 h\p://www.hastac.org/node/1934

Why does it ma\er? “If I had to predict some interesLng things for the future in the area of access, I'd sum it up in one word: scale. Big, massive, scale. That's what digiLzaLon brings -‐ access to far, far more cultural heritage materials than you could ever access before.”

2009 interview with Bre\ Bobley, Nat’l Endowment of HumaniLes, US, on predicLons for the future for Digital HumaniLes

Bobley’s PredicLon, cont.

In a world of big, massive scale, he asks: •  “How might quanLtaLve technology-‐based methodologies like data mining help you to be\er understand a giant corpus? Help you zero in on issues?”

•  “What if you are a historian and you now have access to every newspaper around the world?”

•  “How might searching and mining that kind of dataset radically change your results?”

Goal of Talk

Introduce technical architectural big data developments around HathiTrust, emerging

examples of use,

… to facilitate discussion around whether Bre\ Bobley’s 2009 predicLon of “scale. Big, massive, scale”, which is here today, can now deliver on

advances for digital humaniLes

#HTRC @HathiTrust

HathiTrust

•  HathiTrust is a consorLum of academic & research insLtuLons, offering a collecLon of millions of Ltles digiLzed from libraries around the world. – Founding members: University of Michigan, Indiana University, University of California, and University of Virginia

http://www.hathitrust.org/htrc

http://www.hathitrust.org

à DisLnguished from

#HTRC @HathiTrust

#HTRC @HathiTrust

Content of HathiTrust

•  Books and journals – Plus pilots around images, audio, born-‐digital

•  DigiLzaLon sources – Google (96.8%, 10,162,104) –  Internet Archive (2.9%, 301,972) – Local (0.3%, 31,840)

#HTRC @HathiTrust

Content Sources

#HTRC @HathiTrust

Content distribuLon

360,000 volumes in Spanish

#HTRC @HathiTrust

Mo/va/on for HTRC

à  HathiTrust repository is massive scale -- latent goldmine for text based research à  Restricted nature of parts of HathiTrust content suggests need for new forms of access that preserves intimate nature of interaction with texts while at same time honoring restrictions on access à  Size and restrictions demand new paradigm: computation moves to the data (not vice versa)

#HTRC @HathiTrust

HathiTrust Research Center

•  The HathiTrust Research Center (HTRC) was established in 2011 to enable computaLonal research across a comprehensive body of published works, for the purposes of scholarship, educaLon, and invenLon.

•  HTRC ExecuLve Commi\ee –  Beth Plale, co-‐Director, Professor of InformaLcs and CompuLng, Indiana University

–  J. Stephen Downie, co-‐Director, Professor of InformaLon Science, University of Illinois

–  Robert McDonald, Indiana University Libraries –  Beth Namachchivaya Sandore, University of Illinois Library –  John Unsworth, CIO, Dean of Library, Brandies University

HTRC system

Complexity hiding interface

The complexity

Tabular info

StaLsLcal plots

SpaLal plots

Request

Complexity

hiding interface

Return to categories of DH acLvity HTRC in current form best at suppor/ng: •  Access: by narrowing down to essenLal materials quickly –

separaLng wheat from chaff “big part of what [digital humaniLes scholar] does is study cultural heritage materials -‐ books, newspapers, painLngs, film, sculptures, music, ancient tablets, buildings, etc.”

•  Produc/on: by supporLng computaLonal invesLgaLon over massive scale of texts that will require large-‐scale computers (cloud compuLng)

•  Consump/on: by tracking the bits and pieces (i.e., the HTRC workset) “The way we read is changing. Bits and pieces of varied content from so many places and perspecLves.”

Interview with Bre\ Bobley, NEH, 2009

Workset manages engagement with texts

EXAMPLES OF RESEARCH THAT IS POSSIBLE AT SCALE

•  Topic modeling •  Author Gender IdenLficaLon •  Using Topic Modeling to Locate (down to sentence

level) Philosophical Arguments in Science Texts

#HTRC @HathiTrust

Topic Modeling

•  Can answer more complex or nuanced quesLons – What are the primary themes of an author? – What are the primary themes of a research domain?

– When did a new topic enter a research domain? •  Provides more data than word counts

– 100s of topics can be extracted. – Underlying data (topics, volume, and page) is available

#HTRC @HathiTrust

Themes for Authors Two topics with idenLcal centraliLes (e.g., Dickens) but separate themes

More strongly focused on book (illustraLons, volume, literature)

More strongly focused on author himself (le\ers, household, house)

Ted Underwood, Univ of Illinois

GENDER IDENTIFICATION OF HTRC AUTHORS BY NAMES

Stacy Kowalczyk, Asst. Professor, Dominican University Zong Peng, HTRC, Indiana University

Talk by Stacy Kowalczyk, h\p://www.hathitrust.org/htrc_uncamp2013

#HTRC @HathiTrust

Gender IdenLficaLon of Text

•  QuesLon InvesLgated: Can we use author names in bibliographic records to idenLfy gender?

•  Looked at 2.6 million bibliographic records –  Extracted personal author data – Marc 100 abcd and 700 abcd

•  606,437 unique personal author strings •  Bibliographic data is not fielded like patent names •  Relying on Standard cataloging pracLce

–  Last name, first name middle name, Ltles/honorifics, dates

#HTRC @HathiTrust

Authors vs Names There is the author, then there are the names under which the author is published… •  Methuen, Algernon Methuen Marshall, Sir bart., 1856-‐1924 •  Methuem, Algernon •  Methuen Algernon •  Methuen Marshall, Sir, bart., 1856-‐ •  Methuen, A. Sir, 1856-‐1924 •  Methuen, A. Sir, bart., 1856-‐1924 •  Methuen Marshall, Sir bart 1856-‐1924 •  Methuen, Algernon Methuen Marshall, Sir, 1856-‐1924 •  Methuen, Algernon Methuen Marshall, Sir, bart.,

1856-‐1924 •  Methuen, Algernon, 1856-‐1924

#HTRC @HathiTrust

Sources of Data •  The Virtual InternaLonal Authority File

– Hosted by OCLC •  Harvested names from mulLple data sources

–  Census bureau –  Baby name sites

•  EU Patent Research names list (Frietsch et al, 2009; Naldi et al. 2005) – Developed an extensive list of European names

•  Titles and honorifics – MulLple web resources –  Sir, Baron, Count, Duke, Father, Cardinal, etc –  Lady, Mrs. Miss, Countess, Duchess, Sister, etc

#HTRC @HathiTrust

IniLal Gender Results

•  Approximately 80% of name strings have iniLal gender idenLficaLon –  Female

•  59,365 •  10%

– Male •  425,994 •  70%

–  Unknown •  114,204 •  19%

–  Ambiguous •  5,965 •  Less than 1%

#HTRC @HathiTrust

Results by Data Source

Against the whole set of name strings •  VIAF

– 19% hit rate •  Web Names

– 54% hit rate •  Patents Names

– 8%

Colin Allen, Jamie Murdock Cogni/ve Science, Indiana University

Ref talk by Jamie Murdock, h\p://www.hathitrust.org/htrc_uncamp2013

Digging into philosophy of science

•  Establish points of contact between philosophy and science: where philosophical arguments on anthropomorphism appear in science texts

•  Use topic modeling to idenLfy the volumes and pages within these volumes that are “rich” in a chosen topic

•  Use semi-‐formal discourse analysis technique to idenLfy key arguments in selected pages to incrementally expose and represent argument structures

The How

•  1315 volumes from HTRC selected using keyword search for ‘darwin’, ‘romanes’, ‘anthropomorphism’, and ‘comparaLve psychology’

•  Set contains lots of uninteresLng books: e.g., college course catalogs

•  Apply topic modeling on 86 volume subset •  Using iPy Notebook

Volume level topic modeling on ‘anthropomorphism’ yields set of

topics

.. Of set of topics, choose ‘16’ as best

Volumes most similar to topic 16

Repeat topic modeling at page level

Topic model at page level for topics anthropomorphism, animal, and psychology

Pick top 3: topics 16, 10, 26

Show documents of topics 10, 16, 26

Drop to sentence level

•  Select three books* with highest aggregate of 20-‐40 topic-‐relevant pages for more precise analysis

•  Model the three books at the sentence level (uses machine learning)

* Start from 1315 texts to start, down to 86, then down to most relevant 3

Promising early results …

Copyright: A Reality Full text download is limited by both

size and by copyright

#HTRC @HathiTrust

CompuLng with Copyrighted materials: HTRC Data Capsule

•  Copyrighted materials can be computed on, but cannot be shared by humans for human (reading) consumpLon

•  Needs computaLonal framework to enable compuLng but restricLng human consumpLon

•  A secure compuLng framework that: –  Trusts that researcher will not deliberately leak data –  Prevents malware acLng on user's behalf from leaking data.

•  Supports Openness: accepts user-‐contributed analysis •  Supports Large-‐scale and low cost: protecLons can be

extended to uLlizaLon of public supercomputers

VM Image Manager

VM Image Store

VM Image Builder

VM Manager

VM instance

Secure Capsule cluster

SSH Research results

Researcher

HTRC Data Capsule Architectural Components

Registry Services, worksets

VM Image

Manager

VM Image Store

VM Image Builder

VM Manager

VM instance

Upon run, Secure Capsule:

controls I/O behind scenes

SSH Research results

Researcher

HTRC Data Capsule interacLon

Researcher requests new VM of type X

Researcher install tools onto VM through window on her desktop.

Registry Services, worksets

Final locaLon of results is registry

1)

2)

Image instance is created

3)

4)

47

HTRC secure data capsule: view from researcher desktop

Thanks to our sponsors

2009: “If I had to predict some interesLng things for the future in the area of access, I'd sum it up in one word: scale. Big, massive, scale. That's what digiLzaLon brings -‐ access to far, far more cultural heritage materials than you could ever access before.”

à Paradigm: computation moves to the data (not vice versa)

2014: We are at massive scale of data, but data access is constrained. Can digital humani/es

researchers work within constraints? Will they find it worthwhile to do so?

Reality: Full text download is limited by size and copyright

bridging digital humanities research and big data repositories of digital text

Presentations & Public Speaking