2010 digital humanities london - dutch republic of letters

26
scholarly communicatio n @ 1650 scholarly communicati on @ 2050 Letters, Ideas and Information Technology Erik-Jan Bos, Univ. Utrecht, [email protected] Charles van den Heuvel, VKS, [email protected] w.nl Dirk Roorda (that’s me), DANS, [email protected] Using digital corpora of letters to disclose the circulation of knowledge in the 17th century

Upload: dirk-roorda

Post on 17-May-2015

333 views

Category:

Education


2 download

DESCRIPTION

Digital Humanities 2010 London. About the CKCC project: Dutch Republic of Letters. With Charles van den Heuvel.

TRANSCRIPT

Page 1: 2010 Digital Humanities London - Dutch Republic of Letters

scholarlycommunication@ 1650

scholarlycommunication@ 2050

Letters, Ideas and Information Technology

Erik-Jan Bos, Univ. Utrecht, [email protected]

Charles van den Heuvel, VKS,[email protected]

Dirk Roorda (that’s me), DANS,[email protected]

Using digital corpora of letters to disclose the circulation of

knowledge in the 17th century

Page 2: 2010 Digital Humanities London - Dutch Republic of Letters

http://ckcc.huygens.knaw.nl/

Page 3: 2010 Digital Humanities London - Dutch Republic of Letters

Nota

Beeckman

Cats STEVIN

Huygens STEVIN

Langeren

relation disciplines

direct - water

indirect - literature

Page 4: 2010 Digital Humanities London - Dutch Republic of Letters

4

Corpora of17th century scholars

Corpora of17th century scholars

Constantijn Huygens Christiaan Huygens Grotius Descartes Swammerdam Leeuwenhoek Barleaus Spinoza and more?

Page 5: 2010 Digital Humanities London - Dutch Republic of Letters

Corpus Number of letters:

In posession?

Format Metadata Normalized?

Grotius 7946 Yes TEI In Interp element

Yes, DBNL codes

Van Leeuwenhoek

337 Yes TEI In Interp element

Yes, DBNL codes

Descartes 750 Yes XML (no TEI)

other markup

No, plain text

Barlaeus 1200 300 ready Word unknown unknown

Swammerdam 80 Yes Word unknown unknown

Constantijn Huygens

7295 Yes xml Probably Interp element

DBNL codes

Christiaan Huygens

2900? Medio 2010 probably TEI

Probably Interp element

DBNL codes

Page 6: 2010 Digital Humanities London - Dutch Republic of Letters

CEN -MetadataCEN -Metadata

Catalogus Epistularum Neerlandaricum265,000 descriptions of approximately 1,000,000 lettersfrom 1600 – now of which100,000 letters in 17th century

Page 7: 2010 Digital Humanities London - Dutch Republic of Letters

Research Questions

• History of science:• How did knowledge circulate in the 17th-

century Dutch Republic?

• Patterns in knowledge growth:• How can we visualise sets of letters that

exhibit features of knowledge circulation?

• Re-use:• How can we expose the sources, annotations,

and resulting patterns to further research?

Page 8: 2010 Digital Humanities London - Dutch Republic of Letters

Challenge

Traditional scholarship• interpretation• close reading• solving puzzles

East is east and

East

WestComputational methods•dealing with patterns•gleaned from large quantities of texts•by automatic tools

West is west and ...

Page 9: 2010 Digital Humanities London - Dutch Republic of Letters

Issues to deal with

• making the sources uniformly available• well coded in TEI, access rights

• overcoming the language barrier • (17th cent varieties of French, Latin, Dutch)

• named entity recognition & concepts• people, places, dates, concepts, instruments• mixture of interpretation and algorithms

• creating useful visualisations• aiding exploration by historians of science

Page 10: 2010 Digital Humanities London - Dutch Republic of Letters

ICT in Humanities Research

• collaboratory• e-Laborate as starting point

• algorithmic pipelines• from source material to visualisation

• infrastructure• archiving results• re-using data• developing new algorithms• disseminating the methodology

Page 11: 2010 Digital Humanities London - Dutch Republic of Letters

collaboratorycollaboratory

Page 12: 2010 Digital Humanities London - Dutch Republic of Letters

pipelines

Page 13: 2010 Digital Humanities London - Dutch Republic of Letters

pipelines (current)

• language detection, usingLanguage Identification from Text Using N-gram Based

Cumulative Frequency Addition

Bashir Ahmed, Sung-Hyuk Cha, and Charles Tappert 2004

• results

Page 14: 2010 Digital Humanities London - Dutch Republic of Letters

pipelines (current)• spelling normalisation

• VARD (http://www.comp.lancs.ac.uk/~barona/vard2/)• with help from (http://www.dicollecte.org/home.php?prj=fr)

• results• French: VARD works (after improvements),

although designed for historical English• Dutch: still on the lookout for a combination of

resources, tools, and dexterity• Latin: later

Page 15: 2010 Digital Humanities London - Dutch Republic of Letters

pipelines (current)

Page 16: 2010 Digital Humanities London - Dutch Republic of Letters

pipelines (current)

• named entity recognition• known tools get 70%• search for optimal tools in the next stage

Page 17: 2010 Digital Humanities London - Dutch Republic of Letters

pipelines (insights)

• expect the most from statistical methods

• language technology may boost results

• it remains to be seen by how much

Page 18: 2010 Digital Humanities London - Dutch Republic of Letters

Topic-Author-TimeTopic-Author-TimeSource: Scott Weingart UIA

Page 19: 2010 Digital Humanities London - Dutch Republic of Letters

infrastructure

Page 20: 2010 Digital Humanities London - Dutch Republic of Letters

the project’s legacy

• more than publications• curated sources, annotations, visualisations

• more than algoritms• a framework for analysis of historical texts

• more than a piece of historical research• data and (intermediate) results worthwhile to

• linguists, computer scientists, sociologists

• more than a passive dataset• extensible, dynamic, interactive

Page 21: 2010 Digital Humanities London - Dutch Republic of Letters

preserving the results

• part of the CLARIN infrastructure• http://www.clarin.eu/ • http://www.clarin.nl/

• materials in a Trusted Digital Repository (DANS)• http://easy.dans.knaw.nl/dms

Page 22: 2010 Digital Humanities London - Dutch Republic of Letters

working with CLARIN

• CLARIN-EU• Outreach to humanities: use cases• CKCC one of 10 selected projects• received expert input for choice of language

tools

• CLARIN-NL• CKCC one of 10 initial projects in the Dutch

national construction effort• support for applying language technology

Page 23: 2010 Digital Humanities London - Dutch Republic of Letters

Adapting to CLARIN

• Conforming to standards

• CLARIN standards are in evolution• (and will remain evolvable)

• Common MetaData Infrastructure• a registry of metadata components• defined by the community• with explicit semantics (http://www.isocat.org/ )

• Data in TEI (as export/import format)

Page 24: 2010 Digital Humanities London - Dutch Republic of Letters

Trusted Digital Repository

• materials• reliable (provenance metadata) • findable (CMDI metadata)• referable (persistent identifiers)• accessible (viewable in webbrowser)• usable (downloadable)

• sooner or later: • high-performance computing• memento: a time-sensitive webinterface to the

dynamic contents of the collaboratory (http://arxiv.org/abs/0911.1112 )

Page 25: 2010 Digital Humanities London - Dutch Republic of Letters

http://www.clarin.eu/node/3073

Page 26: 2010 Digital Humanities London - Dutch Republic of Letters

http://ckcc.huygens.knaw.nl/