efficient named entity annotation through pre-empting

Download Efficient named entity annotation through pre-empting

If you can't read please download the document

Upload: leon-derczynski

Post on 14-Apr-2017

313 views

Category:

Science


0 download

TRANSCRIPT

Slide 1

Efficient Named Entity Annotation
through Pre-empting

Leon

Kalina Bontcheva

Crowdsourcing in science is not new

Citizen science, from early 19th century, 60,000 80,000 yearly volunteers

Sir Francis Galton, VOX POPULI

Francis Galton

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Crowdsourcing as an effective paradigm

Researchers enjoy annotating

which makes it expensive

Many documents are inefficient to annotate

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

What is Crowdsourcing?

Crowdsourcing is an emerging collaborative approach for acquiring annotated corpora and a wide range of other linguistic resources

Three main kinds of crowdsourcing platforms

paid-for marketplaces such as Amazon Mechanical Turk (AMT) and CrowdFlower (CF)

games with a purpose

volunteer-based platforms such as crowdcrafting

NLP researchers are increasingly using crowdsourcing as a novel, collaborative approach for obtaining linguistically annotated corpora

Example: CF Instructions

Example: CF Marking Locations in tweets

Example: CF Locations selected

Example 2: Entity Linking Annotation in CF

How to do it: The Easy Way

Download and use the GATE Crowdsourcing plugin

https://gate.ac.uk/wiki/crowdsourcing.html

Transforms automatically texts with GATE annotations into CF jobs

Generates the CF User Interface (based on templates)

Researcher then checks and runs the project in CF

On completion, the plugin automatically imports the results back into GATE, aligning sentences and representing the multiple annotators

GATE Crowdsourcing Overview (1)

Choose a job builder

Classification

Sequence Selection

Configure the corresponding user interface and provide the task instructions

GATE Crowdsourcing Overview (2)

Pre-process the corpus with TwitIE/ANNIE, e.g.

Tokenisation

POS tagging

Sentence splitting

NE recognition

Create automatically the target annotations and any dynamic values required for classification

Execute the job builder to upload units to CF automatically

Configure and execute the job in CF

Gold data units can also be uploaded from GATE, so CF controls quality

Automatic CF Import into GATE

Each CF judgement is imported back as a separate annotation with some metadata

Adjudication can happen automatically (e.g. majority vote and/or trust-based) or manually (Annotation Stack editor)

The resulting corpus is ready to use for experiments or can be exported out of GATE as XML/XCES

Side effect

Medium-size corpus, and a...

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

How can this cost be reduced?

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

How can this cost be reduced?

Introduce determinism

Hypothesis: do entity-bearing sentences improve NER performance?

Features:Character n-grams

Word shape n-grams

Token n-grams











Pretty good!

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Can we predict entities?

Baselines:1. Random

2. All proper nouns = entities

Classifiers:Maxent

SVM

Cost-weighted SVM







In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Can we predict entities?







In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Validating the results: again

Saving money through ML?

A bit too good to be true.. or is it

Compare hand-labelled pre-empted to hand-labelled random

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Cross-lingual investigation

English is a bit boring

How about something else?

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Cross-lingual investigation

English is a bit boring

germanic; non-germanic; morphologically rich










Entity prediction universally great!

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Cross-lingual investigation

This looks good! But how about extrinsic results?

Does this help NER?

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Building a cropus

Let's try building a corpus

Social media: high variation

Insufficient diversity in NLP researchers (KKTNY in 45min...)

Does our hypothesis apply in this text type?

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Building a cropus

Can we pre-empt in tweets as well?

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Building a cropus

Let's try and get greedy can we do this per-type?

Entity classifications tend to be arbitrary

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Which features are useful?

Feature ablation

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

Which features are useful?

Highest-weighted features

In fact, already in 1907, Sir Francis Galton, (Darwins cousin, A brilliant Victorian scientist,) has published a Nature article entitled VOX Populi (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on.

Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science.

This is not a novel phenomenon

Citizen science projects around since the beginning of last century (at least)

There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Loras paper (her talk might have some mentions as well)

IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess

GWAP Use Case

Language Quiz
www.twitter.com/uCompEU

Master Thesis Tutorial; Karl Wber

Language Quiz

Provide an open API to enable partners to send news tasks to the game (or crowdflower)

The game supports various task types (at launch: multiple choice questions and sentiment detection)

Players receive points through correct answers in the game

The correct answers will be determined by majority vote, after enough answers have been collected

Each month the highscores will be reseted and a monthly winner is determined

Players are able to invite their friends, compete against them and receive bonus points through their activity

Language Quiz

Thank you for your time!

Leon Derczynski

Kalina Bontcheva


This was part of the uComp

project (www.ucomp.eu). uComp receives the

funding support of EPSRC EP/K017896/1, FWF

1097-N23, and ANR-12-CHRI-0003-03, in the

framework of the CHIST-ERA ERA-NET.

University of Sheffield, NLP

www.ucomp.eu | www.chistera.eu @uCompEU

Master Thesis Tutorial; Karl Wber

08/09/15