a collaborative, indeterministic and partly automatized approach to
TRANSCRIPT
A Collaborative, Indeterministic and PartlyAutomatized Approach to Text Annotation
Thomas Bögel1, Evelyn Gius2, Marco Petris2, Jannik Strötgen1
The heureCLÉA Project1Heidelberg University, 2University of Hamburg
http://www.heureclea.de/
July 8, 2014
A Collaborative, Indeterministic and PartlyAutomatized Approach to Text Annotation
Thomas Bögel1, Evelyn Gius2, Marco Petris2, Jannik Strötgen1
The heureCLÉA Project1Heidelberg University, 2University of Hamburg
http://www.heureclea.de/
July 8, 2014
Motivation NLP & UIMA UIMA-Catma
Motivation
Different types of annotations:some annotation tasks are complex, e.g.,
flashbacks, prolepsis, analepsis, ...
some annotation tasks are simple, e.g.,sentences, part-of-speech information, tenses, ...
Manual creation of simple annotationsslowtime-consumingboring
Spend time on complex tasks, add simple annotationsautomatically
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 1 / 8
Motivation NLP & UIMA UIMA-Catma
Motivation
Different types of annotations:some annotation tasks are complex, e.g.,
flashbacks, prolepsis, analepsis, ...some annotation tasks are simple, e.g.,
sentences, part-of-speech information, tenses, ...
Manual creation of simple annotationsslowtime-consumingboring
Spend time on complex tasks, add simple annotationsautomatically
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 1 / 8
Motivation NLP & UIMA UIMA-Catma
Motivation
Different types of annotations:some annotation tasks are complex, e.g.,
flashbacks, prolepsis, analepsis, ...some annotation tasks are simple, e.g.,
sentences, part-of-speech information, tenses, ...
Manual creation of simple annotationsslowtime-consumingboring
Spend time on complex tasks, add simple annotationsautomatically
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 1 / 8
Motivation NLP & UIMA UIMA-Catma
Motivation
Different types of annotations:some annotation tasks are complex, e.g.,
flashbacks, prolepsis, analepsis, ...some annotation tasks are simple, e.g.,
sentences, part-of-speech information, tenses, ...
Manual creation of simple annotationsslowtime-consumingboring
Spend time on complex tasks, add simple annotationsautomatically
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 1 / 8
Motivation NLP & UIMA UIMA-Catma
Natural Language Processing
Automatic processing of text datagoals (among others):
information extractionautomatic annotations
UIMA
UIMA: Unstructured Information Management Architecturecomponent framework for unstructured data (e.g., text)helps to connect tools not developed to be used together→ all components rely on the same data structure
(Common Analysis Structure, CAS)
UIMA programs are processing pipelines
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 2 / 8
Motivation NLP & UIMA UIMA-Catma
Natural Language Processing
Automatic processing of text datagoals (among others):
information extractionautomatic annotations
UIMA
UIMA: Unstructured Information Management Architecturecomponent framework for unstructured data (e.g., text)helps to connect tools not developed to be used together→ all components rely on the same data structure
(Common Analysis Structure, CAS)
UIMA programs are processing pipelines
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 2 / 8
Motivation NLP & UIMA UIMA-Catma
Natural Language Processing
Automatic processing of text datagoals (among others):
information extractionautomatic annotations
UIMA
UIMA: Unstructured Information Management Architecturecomponent framework for unstructured data (e.g., text)helps to connect tools not developed to be used together→ all components rely on the same data structure
(Common Analysis Structure, CAS)
UIMA programs are processing pipelines
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 2 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
UIMA pipelines contain three components
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
AnalysisEngines
CASConsumer
DocsResults
AnalysisEngines
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
AnalysisEngines
CASConsumer
DocsResults
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
AnalysisEngines
CASConsumer
DocsResults
Collection Readerreads documents from source (e.g., file system, database)
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
AnalysisEngines
CASConsumer
DocsResults
CAS
Collection Readerreads documents from source (e.g., file system, database)instantiates a CAS for each document
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
AnalysisEngines
CASConsumer
DocsResults
CAS doc text metadata
Collection Readerreads documents from source (e.g., file system, database)instantiates a CAS for each documentinitializes CAS with doc text (metadata, etc.)
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
CASConsumer
DocsResults
AnalysisEngines
AnalysisEngines
AnalysisEngines
AnalysisEngines
AnalysisEngines
CAS doc text metadata
Analysis Enginesusually several Analysis Enginesanalyze the document
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
CASConsumer
DocsResults
AnalysisEngines
AnalysisEngines
AnalysisEngines
AnalysisEngines
AnalysisEngines
CAS doc text metadata
Analysis Enginesusually several Analysis Enginesanalyze the documentread content of the CAS
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
CASConsumer
DocsResults
AnalysisEngines
AnalysisEngines
AnalysisEngines
AnalysisEngines
AnalysisEngines
CAS doc text metadata annotations
Analysis Enginesusually several Analysis Enginesanalyze the documentread content of the CASadd annotations to the CAS
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
CASConsumer
DocsResults
AnalysisEngines
CAS doc text metadata annotations
CAS Consumerreads content of the CAS
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
CASConsumer
DocsResults
AnalysisEngines
CAS doc text metadata annotations
CAS Consumerreads content of the CASdoes final processing
evaluation, visualization, indexing
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
CASConsumer
DocsResults
AnalysisEngines
CAS
UIMA - What’s the clue?
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
CASConsumer
DocsResults
AnalysisEngines
CAS
UIMA - What’s the clue?single components are not directly connected to each otherinstead: CAS object
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Components of a UIMA Pipeline
CollectionReader
CASConsumer
DocsResults
CAS
AnalysisEnginesAnalysisEngines
UIMA - What’s the clue?single components are not directly connected to each otherinstead: CAS objectcomponents are independent of each othercomponents only have to be able to handle CAS
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8
Motivation NLP & UIMA UIMA-Catma
Example Tasks
heureCLÉA project:sentence splittingtokenizationpart of speech taggingtemporal expressionstemporal signalstense
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 4 / 8
Motivation NLP & UIMA UIMA-Catma
Example Tasks
heureCLÉA project:sentence splittingtokenizationpart of speech taggingtemporal expressionstemporal signalstense
further projects:sentence splittingtokenizationpart of speech taggingtemporal expressionsgeographic expressionsnamed entities (persons)event extraction
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 4 / 8
Motivation NLP & UIMA UIMA-Catma
Example Tasks
heureCLÉA project:sentence splittingtokenizationpart of speech taggingtemporal expressionstemporal signalstense
further projects:sentence splittingtokenizationpart of speech taggingtemporal expressionsgeographic expressionsnamed entities (persons)event extraction
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 4 / 8
Motivation NLP & UIMA UIMA-Catma
The Temporal Tagger HeidelTime
Extraction and normalization of temporal expressionsJuly 8, 2014 → 2014-07-08today → 2014-07-08
Most of the work so far: English news documents
HeidelTime: Multilingual, Cross-domain Temporal Tagger
Current languages:English, Spanish, German, French, ItalianDutch, Arabic, Vietnamese, Chinese, Russian
Publicly available:UIMA & standalone versions, online demo@ Google Code
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 5 / 8
Motivation NLP & UIMA UIMA-Catma
The Temporal Tagger HeidelTime
Extraction and normalization of temporal expressionsJuly 8, 2014 → 2014-07-08today → 2014-07-08
Most of the work so far: English news documents
HeidelTime: Multilingual, Cross-domain Temporal Tagger
Current languages:English, Spanish, German, French, ItalianDutch, Arabic, Vietnamese, Chinese, Russian
Publicly available:UIMA & standalone versions, online demo@ Google Code
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 5 / 8
Motivation NLP & UIMA UIMA-Catma
The Temporal Tagger HeidelTime
Extraction and normalization of temporal expressionsJuly 8, 2014 → 2014-07-08today → 2014-07-08
Most of the work so far: English news documents
HeidelTime: Multilingual, Cross-domain Temporal Tagger
Current languages:English, Spanish, German, French, ItalianDutch, Arabic, Vietnamese, Chinese, Russian
Publicly available:UIMA & standalone versions, online demo@ Google Code
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 5 / 8
Motivation NLP & UIMA UIMA-Catma
The UIMA-Catma Pipeline
CollectionReader
AnalysisEngines
CASConsumer
DocsResults
AnalysisEngines
Collection Reader: reads documents from a sourceAnalysis Engines: add annotationsCAS Consumer: does final processing
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 6 / 8
Motivation NLP & UIMA UIMA-Catma
The UIMA-Catma Pipeline
CATMACollectionReader
CATMACAS
Consumer
AnalysisEngines
Collection Reader: reads document from CATMAAnalysis Engines: add annotationsCAS Consumer: sends annotations to CATMA
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 6 / 8
Motivation NLP & UIMA UIMA-Catma
Live Demo
So far: manual annotations of adverbs and superlatives→ nice annotation task to get familiar with CATMA→ boring if you want to annotate a whole book
Our CATMA-UIMA workflow:CATMA Collection Reader → get documents from CATMATreeTagger wrapper (part of UIMA HeidelTime kit)HeidelTime→ sentence splitting, tokenization, part-of-speech tagging→ temporal expressions
CATMA CAS Consumer: → send annotations to CATMA
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 7 / 8
Motivation NLP & UIMA UIMA-Catma
Live Demo
So far: manual annotations of adverbs and superlatives→ nice annotation task to get familiar with CATMA→ boring if you want to annotate a whole book
Our CATMA-UIMA workflow:CATMA Collection Reader → get documents from CATMATreeTagger wrapper (part of UIMA HeidelTime kit)HeidelTime→ sentence splitting, tokenization, part-of-speech tagging→ temporal expressions
CATMA CAS Consumer: → send annotations to CATMA
DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 7 / 8