2012 ehumanities amsterdam - descartes text conversion: lessons learned

31
Letters from Descartes in digital format An exercise in conversion Dirk Roorda @ eHumanities 2012-01-26

Upload: dirk-roorda

Post on 26-Jun-2015

286 views

Category:

Education


1 download

DESCRIPTION

The arduous process of producing a digital text of Descartes' letters, including mathematical formulas. It was a subtask of the CKCC project at the Huygens Institute. Lessons learned. With Erik-Jan Bos, Utrecht.

TRANSCRIPT

Page 1: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

Letters from Descartes in

digital formatAn exercise in conversion

Dirk Roorda@ eHumanities 2012-01-26

Page 2: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

the task the method the lessons the result

◦ demo

overview

Page 3: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

The Task: converting from ...JapAM

Descartes Correspondence

ca. 700 letters

69,237 lines

600 formulas

4.2 MB (without the 311 pictures)

Page 4: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned
Page 5: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

The task: converting to ...CKCC corpus Descartes

XML : Text Encoding Initiative (TEI)

~ 35,000 elements, of which7,200 metadata

7,700 paragraphs6,200 formulas

6,000 text-formattings4,200 structure

2,900 page-breaks538 images

Page 6: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned
Page 7: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

The (re)Sources

EJB

Metadata

Google Books

EJB ‘s head

Page 9: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

observation

non-algorithmic changes

consolidation

proofs

The method

Page 10: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

use digital equipment:

-your text-editor

-your scripting language

-your regular expressions

Observation

Page 12: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

observation: italic scopes

replace=(.*?)$

by<italic>match1</italic>

???

Aargh!#@\€]

Page 13: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

observation: greek

Page 14: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned
Page 15: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

non-algorithmic changes

Page 16: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

closers: hints

Page 17: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

consolidating: metadata

... formulas meta closers ...

conversion process

canonical

initial

corrected

improved

checked metadata combining

Page 18: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

merging meta

Page 19: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

proofs: formulas

Page 20: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

proofs: formulas in gif

Page 21: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

quick formula checking

Page 22: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

The anatomy of conversion

convert.pl

100 KB of program code text=25 densely typed pages=3427 lines

of which

2175 real code lines

Code/Input = 1/32

Page 23: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned
Page 24: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

1/3 of the tasks need 2/3 of the codeformulas: (2) 37 %headers, openers, closers: (3) 16 %meta and images: (3) 11 %

run time of same tasksformulas: (2) 29 %headers, openers, closers: (3) 6 %meta and images (3) 10 %total run time (25) 40 sec

Statistics

Page 25: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

1. Unicode is your friend2. Split into many subtasks3. task = configuration + workflow4. Count and check5. Performance matters6. Do not give up automation

The tricks of conversion

Page 26: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

1. Unicode is your friend

Page 27: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

(2a) that can be run separately

(2b) that can be reordered easily

2. Split into many subtasks

Page 28: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

3. task = config + workflow

Page 29: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

4. Count and check (ad nauseam)

Page 30: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

was 30+ secondsis now 2.07 secondsmany new subtasks based on same template(gain = 15 * 30 = 7.5 min per run)many, many runs before everything is OK(gain = 100 * 7.5 = 12.5 hours CPU-time)

5. Performance matters!

Page 31: 2012 eHumanities Amsterdam - Descartes text conversion: Lessons learned

we used a lot of expert knowledgewhich has all been transferred to- the source- consolidated extra inputsso the conversion is still repeatable and modifiable

6. Do not give up automation

source formulas meta closers results

corrections hints hints hints CKCC

conversion program

Thank You