towards a reference corpus of web genres for the evaluation of genre identification systems

27
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 1/27 Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems Georg Rehm 1 , Marina Santini 2 , Alexander Mehler 3 , Pavel Braslavski 4 , Rüdiger Gleim 3 , Andrea Stubbe 5 , Svetlana Symonenko 6 , Mirko Tavosanis 7 , Vedrana Vidulin 8 Language Resources and Evaluation Conference – LREC 2008 University of Tübingen, Germany 1 SFB 441: Linguistic Data Structures DSV, Sweden 2 KTH-Stockholm University University of Bielefeld, Germany 3 Computational Linguistics Dept. Inst. of Engineering Science, RAS 4 Ekaterinenburg, Russia conject AG 5 Munich, Germany Nitol, LLC 6 Moscow, Russia Università di Pisa, Italy 7 Dipartimento di Studi italianistici Jožef Stefan Institute 8 Ljubljana, Slovenia Corresponding author: georg.rehm@uni- tuebingen.de

Upload: bevis-browning

Post on 03-Jan-2016

14 views

Category:

Documents


0 download

DESCRIPTION

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems. Georg Rehm 1 , Marina Santini 2 , Alexander Mehler 3 , Pavel Braslavski 4 , Rüdiger Gleim 3 , Andrea Stubbe 5 , Svetlana Symonenko 6 , Mirko Tavosanis 7 , Vedrana Vidulin 8. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 1/27

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Georg Rehm1, Marina Santini2, Alexander Mehler3,Pavel Braslavski4, Rüdiger Gleim3, Andrea Stubbe5,

Svetlana Symonenko6, Mirko Tavosanis7, Vedrana Vidulin8

Language Resources and Evaluation Conference – LREC 2008

University of Tübingen, Germany1

SFB 441: Linguistic Data Structures DSV, Sweden2

KTH-Stockholm UniversityUniversity of Bielefeld, Germany3

Computational Linguistics Dept.

Inst. of Engineering Science, RAS4

Ekaterinenburg, Russiaconject AG5

Munich, GermanyNitol, LLC6

Moscow, Russia

Università di Pisa, Italy7

Dipartimento di Studi italianisticiJožef Stefan Institute8

Ljubljana, SloveniaCorresponding author:

[email protected]

Page 2: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 2/27

Introduction

Genres are specific types of text.

Genres have, roughly speaking, three characteristic properties:

- Content topic

- Form layout, design, text structure etc.

- Function communicative purpose etc.

Genres are socially specified sets of rules and conventions.

Genres are recognised by particular discourse communities.

Genres usually have established names.

Page 3: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 3/27

Examples of Traditional Genres

Page 4: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 4/27

Scope of this Talk

There are not only hundreds (Dimter, 1981), but thousands (Adamzik, 1995) of genres:

- Shopping list

- Love letter

- Flyer

- Weather forecast

- CV

- PhD thesis

- …

This talk is not about traditional, paper-based genres.

This talk is about web genres.

Page 5: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 5/27

Web Genres

Studies have shown that genres also exist in the web, e.g.:

- Personal homepage

- FAQ

- Blog

- Search engine

- Encyclopedia

- Web shop

Web genres are more complex than traditional genres:

- The web is a hypertext system

- Interactive features

- Multimedia

Page 6: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 6/27

Automatic Web Genre Identification

If we were able to identify web genres automatically, we could exploit this information in search engines. Find:

- textbook web pages that contain “language resource”

- PhD thesis web pages that contain “RCG parsing”

About 20 different approaches have been published in this area (incl. the identification of traditional genres). They mainly use

- Machine learning methods

- Hand-crafted genre detection rules

Page 7: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 7/27

Automatic Web Genre Identification

All approaches have some characteristics in common.

Nearly every group of researchers

- have their own personal definition of “web genre”,

- create their own document collection,

- create their own set of web genre labels,

- annotate their corpora with these web genre labels.

Web Genre Identification Approach

Classification algorithm

Corpus (collection of web documents)

Tag set (genre categories)

DIY

DIY

DIY

Page 8: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 8/27

Automatic Web Genre Identification

Approach 1

Algorithm 1

Corpus 1

Tag set 1

Approach 2

Algorithm 2

Corpus 2

Tag set 2

Approach 3

Algorithm 3

Corpus 3

Tag set 3

Approach 4

Algorithm 4

Corpus 4

Tag set 4

Approach 5

Algorithm 5

Corpus 5

Tag set 5

It’s impossible to compare such isolated approaches.

Page 9: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 9/27

Towards a Reference Corpus of Web Genres

Approach 1

Algorithm 1

Approach 2

Algorithm 2

Approach 3

Algorithm 3

Approach 4

Algorithm 4

Approach 5

Algorithm 5

Reference Corpus of Web Genresenables comparative evaluation

Page 10: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 10/27

Towards a Reference Corpus of Web Genres

Approach 1

Algorithm 1

Approach 2

Algorithm 2

Approach 3

Algorithm 3

Approach 4

Algorithm 4

Approach 5

Algorithm 5

Reference collection

of web documents

Shared genre

category set or sets

Page 11: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 11/27

Towards a Reference Corpus of Web Genres

Approach 1

Algorithm 1

Approach 2

Algorithm 2

Approach 3

Algorithm 3

Approach 4

Algorithm 4

Approach 5

Algorithm 5

Reference collection

of web documents

Shared genre

category set or sets

Page 12: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 12/27

Assigning Genre Labels to Web Pages

The construction of a genre corpus involves the task of assigning genre labels to web documents by a group of annotators.

Previous studies have shown that this is a very hard task.

tag with genre category Set of genre

categories

Page 13: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 13/27

Preliminary Study

We conducted a survey amongst the group of authors:

- Goal: to measure the agreement of genre labels assigned to a random sample of 50 web documents by persons who are engaged in genre-related research.

- Seven of the nine authors participated.

Result: the categories assigned by the participants contain a very high number of disparate terms at various levels of abstraction.

Conclusion: the task of assigning genre labels to web documents is – even for linguists who work on genres – very hard.

Page 14: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 14/27

Assigning Genre Labels to Web Pages

Consistency: High

• Participant 1: News article • Participant 2: Article/commentary• Participant 3: Article• Participant 4: Feature• Participant 5: A newsletter article• Participant 6: News article• Participant 7: Journalistic

Consistency: High

• Participant 1: News article • Participant 2: Article/commentary• Participant 3: Article• Participant 4: Feature• Participant 5: A newsletter article• Participant 6: News article• Participant 7: Journalistic

Page 15: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 15/27

Assigning Genre Labels to Web Pages

Consistency: Low

• P1: Entry page of the website of a research journal • P2: Table of contents with snippets• P3: Portal, link collection• P4: Bibliography/List of Articles• P5: A homepage of a subscription-based academic journal• P6: Homepage• P7: Index, Content Delivery

Consistency: Low

• P1: Entry page of the website of a research journal • P2: Table of contents with snippets• P3: Portal, link collection• P4: Bibliography/List of Articles• P5: A homepage of a subscription-based academic journal• P6: Homepage• P7: Index, Content Delivery

Page 16: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 16/27

Genre Category Sets in Previous Approaches

Almost all category sets used in previous approaches are

- limited in size and scope and

- contain categories that cannot be considered genres:

Lim et al. (2005) Personal homepages; Public homepages; Commercial homepages; Bulletin collections; Link collections; Image collections; Simple tables/lists; Input pages; Journalistic materials; Research reports; Official materials; Informative materials; FAQs; Discussions; Product specifications; Others

Vidulin et al. (2007)

Blog; Childrens’; Commercial/Promotional; Community; Content Delivery; Entertainment; Error Message; FAQ; Gateway; Index; Informative; Journalistic; Official; Personal; Poetry; Scientific; Shopping; User Input

Page 17: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 17/27

Shared Genre Category Sets

A set of genre categories is needed so that we can assign web genre labels to web documents.

Requirements for this shared category set:

- It should be precise, scalable, as unambiguous as possible, and reflect the genre-reality as it presents itself in the web.

- The majority of researchers in this field should agree upon the category set or sets.

We used a wiki to come up with an initial proposal of 78 web genre categories.

Page 18: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 18/27

Our Proposal for a Shared Genre Category Set

1. About Page 2. Abstract 3. Agenda (Schedule, Calendar) 4. Announcement 5. Application 6. Bibliography 7. Biography 8. Chronicle 9. Code Listings 10. Column / Editorial / Lead Article 11. Comic 12. Contact Form 13. Contract / Disclaimer / Terms and Conditons 14. Corporate Blog 15. Curriculum Vitae / CV / Resume 16. Data / Statistics / Data Sheet 17. Diary, Blog 18. Dictionary 19. Directory of Persons or Organisations 20. Discussion Group / Newsgroup 21. Download 22. Drama / Play 23. Encyclopedia 24. Errata 25. Error Message / Empty Page / Under Construction Page 26. Essay 27. Exercises (Problems) 28. FAQ 29. Feature Story / News Reportage 30. Game (Quiz, Puzzle) 31. Glossary 32. Guestbook 33. Homepage / Front Page / Entry Page 34. Horoscope 35. Index 36. Instruction 37. Interview 38. Invitation 39. Job Listing 40. Joke 41. Law / Regulation / Rule / Proclamation 42. Letter / Mail / E-Mail 43. Letter to the Editor 44. Linkfarm 45. Link Collection / Hotlist 46. List of Products 47. List of Projects 48. Login Page 49. Media (Images, videos, music, sound) 50. Meeting minutes 51. News Article 52. News Collection / Newsletter / Digest 53. Obituary 54. Official Report 55. Ordering Form / Booking Form 56. Pamphlet 57. Petition 58. Promotional / Advertisement 59. Poem / Poetry / Lyrics 60. Pornographic 61. Prose Fiction 62. Quotation 63. Reportage 64. Research Report 65. Review (Testimonial) 66. Script (Manuscript) 67. Search Form 68. Sermon 69. Shop 70. Specification 71. Speech 72. Splash Page / Gateway / Welcome Page 73. Strategic Plans 74. Survey 75. Table of contents / Sitemap / Navigation 76. Thesis 77. Travel Guide 78. Tutorial

Page 19: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 19/27

Tagging HTML Documents with Genre Categories

tag

1) tag HTML documents; the most common approach

tag

2) tag websites

tag

tag

tag

tag

tag

3) tag page segments

Page 20: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 20/27

Towards a Reference Corpus of Web Genres

Approach 1

Algorithm 1

Approach 2

Algorithm 2

Approach 3

Algorithm 3

Approach 4

Algorithm 4

Approach 5

Algorithm 5

Reference collection

of web documents

Shared genre

category set or sets

Page 21: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 21/27

Reference Collection of Web Documents

We plan to build the reference corpus in two stages:

- First, we will apply our shared set of genre categories to existing collections as a proof of concept.

Initial step towards an objective evaluation and integrative compatibility of individual approaches.

- Second, we will use a crawler to gather more recent as well as more diverse sets of documents.

Page 22: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 22/27

Reference Collection of Web Genres (Selection)

Web Corpus for English (Santini, 2007): editorial, biography, do-it-yourself guide, feature article (20 web pages each).

German corpus (Mehler et al., 2007, 2008): conference website (50 sites), personal academic homepage (68 sites), project website (52 sites), city website (180 sites).

Hierachical Web Genre Collection (Stubbe and Ringlstetter, 2007), 32 genre classes, 40 HTML files/class, English.

Corpus of 400 blog posts, Italian (Tavosanis, 2007).

English (65,177 pages) and Russian (29,650 pages) corpora (Sharoff, 2007).

Page 23: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 23/27

Towards a Reference Corpus of Web Genres

Approach 1

Algorithm 1

Approach 2

Algorithm 2

Approach 3

Algorithm 3

Approach 4

Algorithm 4

Approach 5

Algorithm 5

Reference collection

of web documents

Shared genre

category set or sets

Page 24: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 24/27

Corpus Management and Annotation Tools

Construction of the reference corpus requires tools that support

- compiling a document collection and

- annotating HTML documents.

We use the HyGraph toolbox:

- Supports researchers in the process of corpus compilation, annotation and analysis

- Annotate at various levels

- Assign confidence values

- Support for multiple tag setsand category systems

- Uses stand-off annotation

Page 25: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 25/27

Towards a Reference Corpus of Web Genres

Reference collection

of web documents

Shared genre

category set or sets

Reference Corpus of Web Genres

Page 26: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 26/27

Summary and Future Work

We construct a reference corpus of web genres.

Provide a shared resource for researchers who work on web genre identification and the evaluation of these systems.

Future work includes the further realisation of this resource:

- Apply a set of genre categories to existing corpora.

- Collect a large set of new documents that will be categorised based on annotation guidelines using HyGraph.

- Assign genre labels to single web documents first and to page segments as well as complete websites later.

Page 27: Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 27/27

Q/A

Thanks for your attention!

Please get in touch if you (plan to) work in the field of automatic web genre identification or a related area:

[email protected]

http://129.70.40.20/WebGenreWiki/

A mailing list will be available soon.