genre as noise - noise in genre andrea stubbe, christoph ringlstetter, klaus u. schulz cis:...

Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Upload: kylie-phillips

Post on 27-Mar-2015




0 download


Page 1: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Genre as Noise -

Noise in Genre

Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz

CIS: University of MunichAICML: University of Alberta

Page 2: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


• For search applications we often would like to narrow down the result set to a certain class of documents

• For corpus construction an exclusion of certain document classes could be helpful

• Documents with a high rate of errors could harm in applications like for example computer aided language learning (CALL) or lexicon construction. Documents of certain classes could be more erroneous like others.

It makes sense to investigate the implications of document genre in the area of noise reduction

Page 3: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Definition of Genre

• Partition of documents into distinct classes of text with similar function and form

• Independent dimension ideally orthogonal to topic

• Examples for document genres: blogs, guestbooks, science reports

• Mixed documents are possible = documents where parts belong to different genres

Page 4: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Two different views on Genre

Page 5: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Two different views on Genre

A document with the wrong genre will often be noise

Page 6: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Two different views on Genre

A document with the wrong genre will often be noise:


Page 7: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Two different views on Genre

A document with the wrong genre will often be noise:


In documents of different genre we find different amounts of noise:

Page 8: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Two different views on Genre

A document with the wrong genre will often be noise:


In documents of different genre we find different amounts of noise:


Page 9: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


• Introduction of a new genre hierarchy• Macro-Noise detection

– Feature Space– Classifiers– Experiments and applications

• Micro-Noise detection– Error dictionaries– Experiments on correlation of genre and noise– Experiments on classification by noise

Page 10: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

A hierarchy of Genres

Demands for a genre classification schema:

• Task oriented granularity

• Hierarchical

• Logically consistent

• Complete

Page 11: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

A hierarchy of Genres

8 container classes with 32 leaf genres

Page 12: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


Containter Classes

• Allow to compare to other classification schemas

• Allow to evaluate the seriousness of classification errors

Training and Evaluation Corpus

• For each of the 32 genres 20 English HTML web documents for training and 20 documents for testing were collected leading to a corpus with 1,280 files.

Page 13: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Detection of Macro-Noise

Macro-Noise detection is a classification problem

• Candidate Features

• Feature selection mechanism

• Build Classifiers

• Combine Classifiers for Classification

Page 14: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Feature Space

Examples for Features

• Form: line length, number of sentences

• Vocabulary: specialized word lists, dictionaries, multi lexemic epr.

• Structure: POS

• Complex patterns: style

All together we got over 200 features for the 32 genres

Page 15: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Feature Space

Kernel question: Selection of features

• Global feature sets for the standard machine learning algorithms

• Specialized feature sets for our specialized classifiers

Small set of significant and natural features for each genre

Avoiding accidental similarities between documents

Page 16: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Feature Space

Feature Selection for specialized genre classifiers


select candidate feature

add feature if performance of classification improves

ordering by classification strength

prune features that have become obsolete

until Recall > 90/75% && Precision > 90/75%

Rules: Constructed as inequations with discriminative ranges

Classifiers: Conjunction of single rules

Page 17: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


Example: Classifier for reportage as a conjunction of single rules

Page 18: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


Classifier Combination

• Filtering: Class as a disqualification criterion for another class in the case of multiple classification

• Ordering by F1 value: Classifiers that lead more probably to a correct classification are applied first

• Ordering by dependencies and recall: A graph with edges that represent the number of wrong classifications of one class as another controls the sequence of classifier application. First, edges with smaller values are traversed leading to fewer wrong classifications

Page 19: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Experiments on Macro-Noise

Detection of Genre:

• On the test corpus we get a precision of 72.2% and an overall recall of 54,00% with the specialized classifiers

• Superior to machine learning methods with SVM as the best method leading to 51.9% precision and to 47.8% recall

• The superiority can be stated only for the small training corpora

• Work for incremental classifier improvement and the behavior on bigger training sets is forthcoming

Page 20: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Experiments on Macro-Noise

Application 1: Retrieving Scientific Articles on fish

• Queries like (cod Λ habitat) are sent to a search engine to retrieve scientific documents

• Evaluation over the 30 top-ranked documents of a query

• Precision and the Recall at cut-points 5,10,15,20 documents could be significantly improved by genre recognition, leaving room for further improvement

Page 21: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Experiments on Macro-Noise

Application 2: Language models for speech recognition

• Language models of speech corpora are notoriously sparse

• Standard solution augmentation by text documents should be improved choosing genres similar to spoken text as: forum, interview, blog

• The noise in a crawled corpus of ~30,000 documents could be reduced to a residue of 2.5%

Page 22: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Detection of Micro-Noise

Examples for Micro-Noise: Typing errors, cognitive errors

• Method: Detection of errors with specialized Error dictionaries

Page 23: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Error Dictionaries

Construction principle: Micro-Noise occurs from elucidable channel characteristics. These characteristics can be discovered in an analytical way or by observations in a training corpus.

• Transition rules:

Ri := lαr ► lβr with l,α, β ,r as character sequences

• These rules are applied to a vocabulary base that should represent the documents to be processed. Productivity depends on context l,r.

• We get a raw error dictionary D_err-raw with entries

[error token | original token | character transition(s)]haracter transition(s)]

Page 24: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Error Dictionaries

Filtering Step:

• The raw error dictionary D_err_raw is filtered against a collection of relevant positive dictionaries leading to two error dictionaries:

• D_err: non word errors

• D_err-ff: word errors, false friends

Page 25: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Error Dictionaries

Usage of error dictionaries:

• With a base of 100,000 English words we got a filtered error dictionary for typing errors with 9,427,051 entries

• For cognitive errors we got a lexicon with 1,202,997 entries

• Recall 60 %, Precision 85% on a reference corpus

• Error detection: scan the text with the error dictionary and compute the mean error rate per 1,000 tokens

Page 26: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Experiments on Micro-Noise

Correlation of error rate and genre:

• For each genre in the genre corpus we computed the errors per 1,000 tokens with the help of the two error dictionaries

• We got a strong correlation between genre and mean error rate

• Extreme values are legal texts with 0.23 errors per 1,000 tokens and guestbooks with 6.23 errors per 1,000 tokens

Page 27: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Experiments on Micro-Noise

Stability of the values for Training and Test corpora: similar plot

Page 28: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Experiments on Micro-Noise

Preliminary experiments on using Micro-Noise for classification:

• Extension of specialized genre classifiers by a filter based on the mean error rate: Improvement of precision for 5 genres but also 1 classifier that lost performance, recall for 3 genres was lower

• SVM classifier with new feature mean error rate: also equivocal results with improvements for some of the genres

• Problem: high variance of the error rate, with error free documents also for genres with a high mean error rate

Page 29: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


Page 30: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


• For certain applications the dimension genre partitions document repositories into noise and wanted documents

Page 31: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


• For certain applications the dimension genre partitions document repositories into noise and wanted documents

• We introduced a new genre hierarchy that allows informed corpus construction

Page 32: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


• For certain applications the dimension genre partitions document repositories into noise and wanted documents

• We introduced a new genre hierarchy that allows informed corpus construction

• Our easy to implement specialized classifiers are able to reach competitive results for genre recognition even with small training corpora

Page 33: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


• For certain applications the dimension genre partitions document repositories into noise and wanted documents

• We introduced a new genre hierarchy that allows informed corpus construction

• Our easy to implement specialized classifiers are able to reach competitive results for genre recognition

• Error dictionaries can be used to estimate the mean error rates of documents

Page 34: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


• For certain applications the dimension genre partitions document repositories into noise and wanted documents

• We introduced a new genre hierarchy that allows informed corpus construction

• Our easy to implement specialized classifiers are able to reach competitive results for genre recognition

• Error dictionaries can be used to estimate the mean error rates of documents

• We found a strong correlation between genre and the error rate

Page 35: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta


• For certain applications the dimension genre partitions document repositories into noise and wanted documents

• We introduced a new genre hierarchy that allows informed corpus construction

• Our easy to implement specialized classifiers are able to reach competitive results for genre recognition

• Error dictionaries can be used to estimate the mean error rates of documents

• We found a strong correlation between genre and the error rate

• Classification by noise leads to equivocal results

Page 36: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Future Work

• We will try to convince other researchers to build up a corpus with at least 1,000 documents per genre

• We work on an incremental learning algorithm for the improvement of our classifiers by user click behavior

• The correlation of genre and error rates will be further investigated on the a bigger genre corpus with an exhaustive statistical analysis

• Regarding the effects of errors on IR applications the repair potential of error dictionaries will be investigated

Page 37: Genre as Noise - Noise in Genre Andrea Stubbe, Christoph Ringlstetter, Klaus U. Schulz CIS: University of Munich AICML: University of Alberta

Thank you for your attention!