2013 e dictor_a_chronology

91
eDictor: (a chronology)

Upload: maria-clara-paixao-de-sousa

Post on 03-Dec-2014

257 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: 2013 e dictor_a_chronology

eDictor:(a chronology)

Page 2: 2013 e dictor_a_chronology

eDictor:(a chronology)

Roundtable: e-dictor, Advances and Perspectives.

Workshop: Construction and use

of large annotated corpora

Campinas, Sept. 9, 2013.

Page 3: 2013 e dictor_a_chronology

2004-2006

Page 4: 2013 e dictor_a_chronology

2004-2006Preliminary Ideas

Page 5: 2013 e dictor_a_chronology

The preliminary ideas that would result in the development of eDictor in 2007 started in 2004 with a project that aimed at restructuring the text-preparation system at the Tycho Brahe Corpus.

>

2004-2006

Page 7: 2013 e dictor_a_chronology

Essentially, the idea was that the Corpus would be constituted of single-source documents that could contain all relevant annotations (textual, philological, linguistic).

>2004-2006

Page 8: 2013 e dictor_a_chronology

This was achieved in partnership with computer scientist Thorsten Trippel, from the University of Bielefeld.

He suggested we used the XML annotation language to re-encode the Corpus, and XSLT to transform each document into different presentations of the encoded information.

>

2004-2006

Page 9: 2013 e dictor_a_chronology

Our central idea was to encapsulate edition interferences at the word level, i.e. for each token in the corpus – so that each element of the pair would be available to different modules of analysis.

>2004-2006

Page 10: 2013 e dictor_a_chronology

This first idea was applied to a few pilot texts, and published as a poster at the annual conference of the ALLC in 2004

PAIXÃO DE SOUSA, M. C.; TRIPPEL, T. Single source processing of Historic corpora for diverse uses.

In: Proceedings of the Association for Literary and Linguistic Computing (ALLC) Annual Conference, 2004.

>

2004-2006

Page 11: 2013 e dictor_a_chronology

In 2005, the Corpus went through a complete re-encoding process.

2004-2006

>

Page 12: 2013 e dictor_a_chronology

The restructured Corpus was composed of XML documents that, via

XSLT transformations, would render different (HTML and TXT) versions, adequate for different visualization and processing needs, as we had originally planned.

>2004-2006

Page 13: 2013 e dictor_a_chronology

The Tycho Brahe Corpus, restructured

(XML base)

2004-2006

Page 14: 2013 e dictor_a_chronology

The Tycho Brahe Corpus, restructured (“catalogue” view)

Page 15: 2013 e dictor_a_chronology

The Tycho Brahe Corpus, restructured (“original” view)

Page 16: 2013 e dictor_a_chronology

The Tycho Brahe Corpus, restructured (“modernized” view)

Page 17: 2013 e dictor_a_chronology

The Tycho Brahe Corpus, restructured (simple text for further processing)

[ prologue (author: P.M. Gandavo)] [ title: AO MUITO ILUSTRE SENHOR DOM LIONIS PEREIRA, Epístola de Pero de Magalhães. ][g_008_s_43] Neste pequeno serviço (muito ilustre senhor ) que ofereço a Vossa Mercê das primícias de meu fraco entendimento, poderá em alguma maneira conhecer os desejos que tenho de pagar com minha possibilidade alguma parte do muito que se deve à ínclita fama de vosso heróico nome. [g_008_s_44] E isto assim pelo merecimento do nobilíssimo sangue e clara progênie de onde traz sua origem, como pelos troféus das grandes vitórias , e casos bem afortunados que lhe hão sucedido nessas partes do Oriente em que Deus o quis favorecer com tão larga mão, que não cuido ser toda minha vida bastante para satisfazer à menor parte de seus louvores . [g_008_s_45] E como todas estas razões me ponham em tanta obrigação , e eu entenda que outra nenhuma coisa deve ser mais aceita a pessoas de altos ânimos que a lição das escrituras , por cujos meios se alcançam os segredos de todas as ciências , e os homens vêm a ilustrar seus nomes e perpetuar os na terra com fama imortal , determinei escolher a Vossa Mercê entre os mais senhores da terra , e dedicar lhe esta breve história . [g_008_s_46] A qual espero que folgue de ver com atenção e receber me a benignamente debaixo de seu amparo : assim por ser coisa nova , e eu a escrever como testemunha de vista : como por saber quão particular afeição Vossa Mercê tem às coisas do engenho , e que por esta causa lhe não será menos aceito o exercício das escrituras , que o das armas. [g_008_s_47] Por onde com muita razão favorecido desta confiança possa seguramente sair a luz com esta pequena empresa e divulgar a pela terra sem nenhum receio , tendo por defensor dela a Vossa Mercê Cuja muito ilustre pessoa nosso Senhor guarde e acrescente sua vida e estado por longos e felizes anos . [ end prologue ]

Page 18: 2013 e dictor_a_chronology

Along with the application of the new single-source system to the Corpus, new ideas started to pop up.

Some of them were carried on, some were not.

2004-2006

>

Page 19: 2013 e dictor_a_chronology

The main thing that we wanted to do back then and still have not done is...

... to integrate syntactic annotation into this same, single-source system...

2004-2006

>

Page 20: 2013 e dictor_a_chronology

Other ideas were a little more fruitful: the integration of other, less complex levels of linguistic annotation (such as items of lexicological interest); and the expansion of the system to include the possibility of critical editions, in which more than one version of the same text could be compared.

2004-2006

>

Page 21: 2013 e dictor_a_chronology

PAIXÃO DE SOUSA, M. C. A Anotação da variação de grafia no Corpus Histórico do Português Tycho Brahe: Frentes abertas para estudos do léxico. V Encontro de Corpora: Lingüística de Corpus: a aplicabilidade nos estudos sobre Léxico, São Carlos, 2005.

Page 22: 2013 e dictor_a_chronology

PAIXÃO DE SOUSA, M. C. Memórias do Texto. Mesa-redonda “Bibliotecas e bancos de dados digitais de literatura”, II Simpósio Nacional de Literatura e Informática, Florianópolis, 2005.

Published in 2006 as:

PAIXÃO DE SOUSA, M. C. Memórias do Texto. Texto Digital (UERJ), v. 1, p. 10, 2006.

Page 23: 2013 e dictor_a_chronology

PAIXÃO DE SOUSA, M. C. Critical Hipereditions and the new challenges for text-critique. Seminário Internacional Literaturas: Del texto al hipertexto. Madri, Universidade Complutense, setembro de 2006.

Published in 2007 as:

PAIXÃO DE SOUSA, M. C. Digital Text: Conceptual and methodological frontiers. In: Dolores Romero; Amelia Sanz. (Org.). Literatures in the Digital Era: Theory and Praxis. Cambridge: Cambridge Scholarly, 2007.

Page 24: 2013 e dictor_a_chronology

By 2006 the single-source encoding system was mature; a first manual was prepared and a more complete paper on these results was published.

>

2004-2006

Page 26: 2013 e dictor_a_chronology

TRIPPEL, T.; PAIXÃO DE SOUSA, M. C. Metadata and XML standards at work: a corpus repository of Historical Portuguese texts. V International Conference on Language Resources and Evaluation (LREC), 2006.

Page 27: 2013 e dictor_a_chronology

TRIPPEL, T.; PAIXÃO DE SOUSA, M. C. Metadata and XML standards at work: a corpus repository of Historical Portuguese texts. V International Conference on Language Resources and Evaluation (LREC), 2006.

Page 28: 2013 e dictor_a_chronology

Meanwhile...

... as the system was presented to a wider range of potential users outside Tycho Brahe, new challenges emerged.

>

2004-2006

Page 29: 2013 e dictor_a_chronology

I Oficina de Anotação – Projeto CorPorA. Salvador, 19-21 de abril, 2006.

Page 30: 2013 e dictor_a_chronology

The 1st annotation workshop outside the Tycho Brahe team, in 2006 in Salvador, was an important breakthrough.

It was then that we noticed that the original techniques used to annotate the XML documents (“by hand”, in E-Macs) and to transform them (by coding XSL into the system via Saxon) was not adequate for teams with a less computational, and more philological background.

>2004-2006

Page 31: 2013 e dictor_a_chronology

I Oficina de Anotação – Projeto CorPorA. Salvador, 19-21 de abril, 2006.

Page 32: 2013 e dictor_a_chronology

After the workshop in 2006 it became clear that if we wanted more teams to use the single-source annotation system, we would have to build a software that could perform the annotation and transformation tasks in a user-friendly interface.

In other words... it was then that the idea of eDictor took shape.

>

2004-2006

Page 33: 2013 e dictor_a_chronology

2007

Page 34: 2013 e dictor_a_chronology

2007eDictor is launched!

Page 35: 2013 e dictor_a_chronology

eDictor beta 1.0 was developed in 2007 by Prof. Fabio N. Kepler (then a post-graduate student at IME-USP’s computer science program), and was first presented in the same year at the VI Encontro de Linguística de Corpus, at USP.

2007

>

Page 37: 2013 e dictor_a_chronology

2007

This first version of eDictor

contained the core functions of the original text encoding system:

an XML annotation module and the possibility of

XSLT transformation exportation.

>

Page 38: 2013 e dictor_a_chronology

2007

Plus... it included a morphosyntactic tagging function!

This first version of eDictor

contained the core functions of the original text encoding system:

an XML annotation module and the possibility of

XSLT transformation exportation.

>

Page 39: 2013 e dictor_a_chronology

Interface of eDictor 1.0 beta 01

Page 40: 2013 e dictor_a_chronology

2008-2012

Page 41: 2013 e dictor_a_chronology

2008-2012years of growing into new uses

Page 42: 2013 e dictor_a_chronology

Two important aspects mark the years

2008 to 2012 for the development of eDictor.

The first was the arrival of a new team

member, Pablo P. F. Faria, who joined F. Kepler in developing the software after the first version.

>2008-2012

Page 43: 2013 e dictor_a_chronology

The second important aspect was that, while up to 2008 the main application of the single-source system (first manually and later with eDictor) was the restructuring of the Tycho Brahe Corpus, after 2008 the system started to be used beyond Tycho Brahe.

>2008-2012

Page 44: 2013 e dictor_a_chronology

>2008-2012

This was important because, as the different projects have different aims, the tool started to include

new technical aspects.

The second important aspect was that, while up to 2008 the main application of the single-source system (first manually and later with eDictor) was the restructuring of the Tycho Brahe Corpus, after 2008 the system started to be

used beyond Tycho Brahe.

Page 45: 2013 e dictor_a_chronology

> For instance, in 2009 eDictor started

to be used by the Brasiliana USP team.

One of the main particularities of this context was that eDictor was used as

a corrector for automatic character recognition (OCR) – and new edition categories had to be created.

2008-2012

Page 47: 2013 e dictor_a_chronology

PAIXÃO DE SOUSA, M. C.; KEPLER, F. N.; FARIA, P. P. F. O Processamento automático de textos antigos: Desafios e Experiências. Workshop de Linguística de Corpus do Projeto Para a História do Português Brasileiro (PHPB), São Paulo, 2010.

Page 51: 2013 e dictor_a_chronology

> One important consequence for eDictor was the possibility of adding new edition categories to the tools Preference archive.

Page 52: 2013 e dictor_a_chronology

> Some of these developments were presented at the VIII Encontro de Linguística de Corpus in 2009 by Pablo Faria; this presentation would be published as a book chapter in 2010.

Page 53: 2013 e dictor_a_chronology

PAIXÃO DE SOUSA, M. C.; KEPLER, F. N.; FARIA, P. E-dictor: Novas perspectivas na codificação e edição de corpora de textos históricos. In: VIII Encontro de Linguística de Corpus, 2009, Rio de Janeiro. 2009.

Page 54: 2013 e dictor_a_chronology

Interface of eDictor in 2009 – Edition Module

Page 55: 2013 e dictor_a_chronology

Example of changes after 1.0 beta 001: Edition Tab – “edition” became an open category

Page 56: 2013 e dictor_a_chronology

> More importantly, researchers that used manuscript documents became interested in eDictor.

The special needs of this kind of material led to very important developments in the tool.

2008-2012

Page 57: 2013 e dictor_a_chronology

> The first group of manuscript documents to be worked with the tool was the corpus of XIXth century letters from the PhD thesis of Zenaide Carneiro (2005) – now part of the corpus CEDOH.

The edition of this corpus in XML had been idealized at the time of the 2006 workshop in Salvador - and from the start, it brought to the development of eDictor the challenge of dealing with particular categories and edition needs of manuscripts.

2008-2012

Page 58: 2013 e dictor_a_chronology

> One important example of developments brought by the needs of manuscript editors are the fac-simile view functionalities.

They were developed by Pablo Faria after eDictor started to be used by the team at CEDOH and by the team lead by Celia Lopes at LaborHistórico, at UFRJ.

2008-2012

Page 59: 2013 e dictor_a_chronology

The CEDOH corpus, with integrated fac-simile view of manuscripts.>

Page 60: 2013 e dictor_a_chronology

The CEDOH corpus, with integrated fac-simile view of manuscripts.

Page 61: 2013 e dictor_a_chronology

This new exporting format - Hypertext with fac-simile view – was integrated in later versions of eDictor, and is currently used by other projects.

Page 62: 2013 e dictor_a_chronology

LaborHistorico – Laboratório para a História do Português Brasileiro,Universidade Federal do Rio de Janeiro. Coord. Célia Lopes

Workshop: “Edição Digital e Divulgação de Textos Antigos”, Rio de Janeiro, 3-5 de fevereiro, 2010.

Page 63: 2013 e dictor_a_chronology

The corpus at LaborHistorico,with integrated fac-simile view of manuscripts.

>

Page 64: 2013 e dictor_a_chronology

> The corpus at LaborHistorico,with integrated fac-simile view of manuscripts.

Page 65: 2013 e dictor_a_chronology

> The workshops with the new teams of users, organized between 2010-2012, resulted in the development of new builds for eDictor beta 1.0 – and also, thanks to the expansion in the number of users, in 2010 we finally got to make a

manual...

2008-2012

Page 67: 2013 e dictor_a_chronology

First Version of eDictor’s Manual (2010)

(... actually, the only version so far)

Page 68: 2013 e dictor_a_chronology

> As a result of this expansion, between 2009 and 2012 ten builds of eDictor beta 1.0 were made, reflecting the additions that were pointed out as necessary by the different user teams.

2008-2012

Page 69: 2013 e dictor_a_chronology

Two important publications were prepared during this period: a poster session at the ALC meeting of 2010, presented by P. Faria, and the chapter for the book “Caminhos da Linguística de Corpus”.

In these papers we tried to cover the backgound on eDictor’s creation, the new developments, and the challenges ahead.

2008-2012

>

Page 70: 2013 e dictor_a_chronology

FARIA, P. P. F.; PAIXÃO DE SOUSA, M. C.; KEPLER, F. N. An Integrated Tool for Annotating Historical Corpora. The Fourth Linguistic Annotation Workshop (LAW IV) at The 48th Annual Meeting of the Association for Computational Linguistics (ALC 2010), Uppsala, 2010.

Page 71: 2013 e dictor_a_chronology

PAIXÃO DE SOUSA, M. C.; KEPLER, F. N.; FARIA, P. E-dictor: Novas perspectivas na codificação e edição de corpora de textos históricos. In: Tania Shepherd; Tony Berber Sardinha; Marcia Veirano Pinto. (Org.). Caminhos da linguística de corpus. Campinas: Mercado de Letras, 2010.

Page 72: 2013 e dictor_a_chronology

2013

Page 73: 2013 e dictor_a_chronology

2013and now, what?

Page 74: 2013 e dictor_a_chronology

> eDictor 1.0 beta build 010 is the current version under use. The main differences in comparison to beta 001 are the additions related to fac-simile integration (in transcription module and in export

functionalities) and some bug-fixing in the editions module.

But there are still bugs to be busted!

2013

Page 75: 2013 e dictor_a_chronology

Interface of eDictor 1.0 beta b010

Page 76: 2013 e dictor_a_chronology

Interface of eDictor 1.0 beta b010

Page 77: 2013 e dictor_a_chronology

2013

> In the end of 2012, a new, web-based version of eDictor was idealized by Luiz Veronesi, and is currently under construction

Page 79: 2013 e dictor_a_chronology

Version 1.0 beta b010 of eDictor is currently being used by seven projects in Brazil and in Portugal

>

Page 80: 2013 e dictor_a_chronology

Corpus Anotado do Português Tycho Brahe(Universidade Estadual de Campinas)

Grupo de Pesquisas Humanidades Digitais (Universidade de São Paulo)

Laboratório de História do Português Brasileiro (Universidade Federal do Rio de Janeiro)

P.S. – Projeto Arquivo Digital de Escrita Quotidiana em Portugal e Espanha na Época Moderna (Universidade de Lisboa)

Corpus Eletrônico de Documentos Históricos do Sertão, CEDOHS (Universidade Federal de Feira de Santana)

Memória Conquistense (Universidade Estadual do Sudoeste da Bahia)> Version 1.0 beta b010 of eDictor is

currently being used by seven projects in Brazil and in Portugal

Page 81: 2013 e dictor_a_chronology

There is still a lot to be done if we want to make eDictor

a stable and fully transferrable tool.

but of course ...>

Page 82: 2013 e dictor_a_chronology

The spirit of this tool has been one of growing into the users’ needs and requests. It will become a better tool if we work together on what we want it to be.

>

Page 83: 2013 e dictor_a_chronology

So we are very excited about this workshop!

>

Page 84: 2013 e dictor_a_chronology

So we are very excited about this workshop!

Here’s one idea of how we could work:

>

Page 85: 2013 e dictor_a_chronology

We are launching today (09/09/2013) a new webpage for eDictor, at http://manualedictor.wordpress.com/.

Page 86: 2013 e dictor_a_chronology

We are launching today (09/09/2013) a new webpage for eDictor, at http://manualedictor.wordpress.com/.

We could use these days at the workshop to build more documentation and group it on the page.

Page 88: 2013 e dictor_a_chronology

That was it.Thank you!

Page 89: 2013 e dictor_a_chronology

That was it.Thank you!

Universidade de São Paulo Maria Clara Paixão de Sousa

[email protected]

Page 90: 2013 e dictor_a_chronology

eDictor:•(a chronology)

Roundtable: e-dictor, Advances and Perspectives.

Workshop: Construction and use

of large annotated corpora

Campinas, Sept. 9, 2013.

Page 91: 2013 e dictor_a_chronology

Roundtable: e-dictor, Advances and Perspectives.

Workshop: Construction and use

of large annotated corpora

Campinas, Sept. 9, 2013.