practical nlp and information extraction with gate · practical nlp and information extraction with...

Post on 17-Jul-2020

18 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PracticalNLPandInformationExtractionwithGATE

Dr. DianaMaynardUniversityofSheffield,UK

Whatistextmining?

• TextMiningisthediscoveryofnew,previouslyunknowninformation,byautomaticallyextractinginformationfromdifferenttextualresources.

• Akeyelementisthelinkingtogetheroftheextractedinformationtogethertoformnewfactsornewhypothesestobeexploredfurther

• Textminingletsyouinvestigatewhat’sactuallyinadocumentoracollectionofdocuments

• Itletsusanswerwh-questions:who,what,why,when,how,where,which?

TextMiningisnotDataMining

Examples:• using consumer purchasing

patterns to predict which products to place close together on shelves in supermarkets

• analysing spending patterns on credit cards to detect fraudulent card use.

• Data mining is about using analytical techniques to find interesting patterns from large structured databases

TextMiningisnotWebSearch

• Text mining is also different from traditional web search.• In search, the user is typically looking for something that is

already known and has been written by someone else.• The problem lies in sifting through all the material that currently

isn't relevant to your needs, in order to find the information that is.• The solution often lies in better ways to ask the right question• You can't ask Google to tell you:

• How does the language used by Donald Trump differ from the language used by Hilary Clinton?

• In which parts of the country did people talk more about the environment during the UK elections?

• Which female MPs talked in the last 6 months about British hospitals with more than 100 deaths per month since 2010?

Basictextprocessing

His MMSE was 23/30 on 15 January 2008.

(MMSE=MiniMentalStateExamination)

Characteroffsets

His MMSE was 23/30 on 15 January 2008.

0....5....10...15...|....|....|....|....

Sentences

His MMSE was 23/30 on 15 January 2008.

0....5....10...15...|....|....|....|....

Tokens

His MMSE was 23/30 on 15 January 2008.

0....5....10...15...|....|....|....|....

Partofspeechcategories

His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....

PP NN VB CD CD PR CD NN CD

MorphologicalAnalysis

VB

His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....

PP NN CD CD PR CD NN CDbe

Knowledgeengineering:findingpatterns

His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....

Month

PP NN VB CD CD PR CD NN CDbe

Knowledgeengineering

His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....

MMSE Month

PP NN VB CD CD PR CD NN CDbe

Knowledgeengineering

His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....

MMSE Month

PP NN VB CD CD PR CD NN CDbe

{number}{Month}{number}

Knowledgeengineering

His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....

MMSE Month

PP NN VB CD CD PR CD NN CDbe

Date

Knowledgeengineering

His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....

MMSE Month

PP NN VB CD CD PR CD NN CDbe

{number}{slash}{number} Date

Knowledgeengineering

His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....

MMSE Month

Score Date

PP NN VB CD CD PR CD NN CDbe

Knowledgeengineering

His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....

MMSE Month

Score Date

PP NN VB CD CD PR CD NN CDbe

{MMSE}{BE}{Score}{?}{Date}

Knowledgeengineering

MMSE Month

Score Date

MMSE with score and date

His MMSE was 23/30 on 15 January 2008.0....5....10...15...|....|....|....|....

PP NN VB CD CD PR CD NN CDbe

Typical Information Extraction pipeline

• Pre-processing (tokenisation, sentence splitting, morphological analysis, POS tagging)

• Entity finding (gazetteer lookup, NE grammars)• Co-reference (alias finding, orthographic co-

reference etc.)• Export the results somewhere (database / XML

/ ontology)

Example of Information Extraction

John lives in London . He works there for Polar Bear Design .

Basic Named Entity Recognition

John lives in London . He works there for Polar Bear Design .

PER LOC ORG

same_as

John lives in London . He works there for Polar Bear Design .

Co-reference

PER LOC ORG

John lives in London . He works there for Polar Bear Design .

Relations

PER LOC ORG

live_in

John lives in London . He works there for Polar Bear Design .

Relations (2)

PER LOC ORG

employee_of

John lives in London . He works there for Polar Bear Design .

Relations (3)

PER LOC ORG

based_in

WhatisEventRecognition?

l Aneventisanactionorsituationrelevanttothedomainexpressedbysomerelationbetweenentitiesorterms.

l Itisalwaysgroundedintime,e.g.theperformanceofaband,anelection,thedeathofaperson

Mitt Romney, the favorite to win the Republican nomination for president in 2012

Event DatePerson

Relation Relation

WhyareEntitiesandEventsUseful?

l Theycanhelpanswerthe“Big5”journalismquestions(who,what,when,where,why)

l Theycanbeusedtocategorisethetextsindifferentwaysl lookatalltextsaboutDonaldTrumpl Theycanbeusedastargetsforopinionminingl findoutwhatpeoplethinkaboutDonaldTrump

l Whenlinkedtoanontologyand/orcombinedwithotherinformation,theycanbeusedforreasoningaboutthingsnotexplicitinthetextl seeinghowopinionsaboutdifferentAmericanpresidents

havechangedovertheyears

GATE: General Architecture for Text

Engineering

WhyGATE?

• GATEisthemostwidelyusedopensourcetoolkitforNLPintheworld• We’reusingitbecauseit’sagreatwaytoshowcaseallthecoreNLP

componentsthatareusedfortextanalysistasks• YoucanplaywithallthetoolsinGATEandtryoutthingsforyourselftosee

howitworks• Experts

• DevelopedattheUniversityofSheffieldsince2000(initscurrentform)• ThepersonwhohasledthedevelopmentoftheNLPtoolsinGATEsince2000

istheonepresentingtoyounowJ

• Andbytheway,justbecauseit’solddoesn’tmeanit’soutofdate.GATEisinconstantdevelopmentwithnewtechnologiesbeingconstantlyadded.

Aboutthistutorial

• ThistutorialwillgetyoustartedwiththeGATEgraphicaluserinterface(GUI),alsoknownas“GATEDeveloper”

• EverythingyoudointheGUIcanbedoneviatheAPI,butit’seasiertoseewhat’sgoingonintheGUI

• Itwillbeahands-onsession.YoucantrythingsoutinGATEasthetopicsarepresented.

• Thingssuggestedforyoutotryyourselfarein red.• DownloadandinstallGATE8.5.1(ifyouhaven’talready)from

http://gate.ac.uk/download• StartGATEonyourcomputer(ifyouhaven'talready)bydoubleclickingthe

icon

GATEGUI

Resources Pane

Menu Bar

Shortcut Buttons

ResourceFeatures

Messages

DisplayPane

Resources

• MostthingsyouusewithinGATEare“resources”:• Languageresources (LRs)aredocuments,documentcollections,

ontologies...• Acollectionofdocumentsisknownasacorpus

• Processingresources (PRs)areprogramsthatoperateontextwithinthedocuments,andoftencreateormodifyannotations

• Datastores areforstoringdocumentsandcorporaforlateruse• Applications (“pipelines”)aresequencesofprocessingresourcesthat

runononeormoredocuments

DisplayingResources

• WhenyoufirstopenGATE,thedisplaypanewillshowmessagesfromthesysteminthe“Messages”tab

• Thedisplaypanedisplayswhateverelementsyouarecurrentlyworkingwith,e.g.anapplication,adocumentoraprocessingresource,eachinitsowntab

• Doubleclickingonaresourceintheresourcespanewilldisplayit• Tabsalongthetopofthedisplaypaneallowyoutochoosewhichof

theopenresourcestodisplay

CreateNewDocument

• FromtheResourcesPane,rightclick“LanguageResources”→New→GATEDocument

• Ignoretheparametersettingsthatwillbedisplayed• ClickOK• “GATEDocument_<id>”willnowbeaddedto“LanguageResources”• Doubleclickthatdocumentname• Atabisopenedinthedisplaypane,showingtheemptydocument.• Nowentersometextthere.

EmptyDocument

DocumentTab

DocumentEditor

DocumentName

DocumentEditor Buttons

DocumentResource Views

DocumentEditor

• TheDocumentEditorisshownasanewTabintheDisplayPane,alongsidetheMessagePane

• TherearebuttonsonthetopoftheEditor,e.g.“AnnotationSets”–wewilllearnaboutthemlater.

• TherearetabsatthebottomoftheDocumentTab:theseshowdifferent“Views”ofthedocument.

• Thesmallpaneinthelowerleftshowsthe“documentfeatures”(optionalinformationassociatedwiththedocumentresourceaskey/valuepairs)

Simpleoperationsonresources

• Rightclickingonthenameofaresourceintheresourcepanegivesaccesstoamenuofactions

• Doubleclickingonthenameofaresourceopensaviewoftheresourceinthedisplaypane(tripleclickingthenamecanbeusedtorename)

• SelectingaresourceinstanceandpressingtheDelete(Mac:Fn+BS)keywillgenerallycloseit

• Youcanalsorightclickandthenselect“Close”

Parameters

• Resourcescanhaveparameterswhichneedtogetspecifiedwhentheresourceiscreated:Initialization(init)Parameters

• Processingresourcescanalsohaveparameterswhichcanbechangedforeachrun:RuntimeParameters

• Init parametersspecifyhowaresourceiscreated,e.g.thelocationofadocumenttoload

• Runtimeparametersconfigurewhataprocessingresourcedoes,e.g.ifsomeprocessingiscase-sensitiveornot.

Loadinganexistingdocument

• GATEcanreadandloaddocumentsinmanyformats:e.g.plaintext,HTML,XML,PDF,Word,CoNLL ,CSV,JSON

• GATEcanloaddocumentsfromfilesandfromURLs• Whenadocumentisloaded,itgetsconvertedtoGATEinternal

formatasdocumenttext+annotations

Loadingadocument

• Toloadadocument:- rightclickonLanguageResources→“New→GATEDocument”OR- Filemenu→ NewLanguageResource→GATEDocument

• UsethesourceURL parametertospecifythedocumenttobeloaded:- typethefilenameorURL,or- clickthefilebrowsericontonavigatetothecorrectdocument

• Loadafilefromyourhands-onmaterials:corpora→news-texts→ft-airlines-27-jul-2001.xml

• Loadawebpage,e.g.http://news.bbc.co.uk

Documentviewer

Documentviewerbuttons

Document

Highlighted tab is the resource currently being viewed

Annotations

• AnnotationsarecentraltoGATE• Annotationsrepresentaspectsofthetextyouwanttoanalyze:

words,sentences,Dates,PersonNames• Annotationsarenamedbytheirtype,e.g.“Person”• Annotationconsistsof

• Annotationtype• startandendoffsets• setoffeatures,eachfeatureisanarbitraryname/valuepair,e.g.

gender=male

AnnotationSets

• Annotationsaregroupedintosets• Eachsetcancontainanynumberofannotationsofanytype• Youcancreateandorganizeyourannotationsetsasyouwish.• Predefinedsets

• Defaultset(emptyname):cannotbedeleted• “Originalmarkups”:annotationsfromthemarkupsinthefile• “Key”:byconvention,usedforgoldstandardannotations

• Clickthe“AnnotationSets”buttoninthedocumentviewerfortheft-airlinesdocumentyouloaded

AnnotationSets

Defaultannotationset

Original markupsannotation set

Annotation types

DocumentViewerButtons

Tabs

Viewingannotations

• ClickingontheAnnotationSetsbuttonopensanewpaneontherighthandsideinsidethedocumentview(AnnotationSetsview)

• Default(unnamed)setcontainssomeexamplesofannotations• ClickontheAnnotationSetname(eg Key)todisplaytheannotation

typesbelongingtothatset• YoushouldseetypessuchasLocation,Date,Personetc.• Clickthecheckboxforanannotationtypetoviewallthe

annotationsofthattypeinthedocument

Acloserlookattheannotations

• ClicktheAnnotationsListbuttonfromthemenuabovetheDisplaypane• Tableshowsannotationtype,annotationset,offsets,annotationid,and

features(forallselectedannotations)• Selectarowinthetabletohighlighttheannotationinthetext• TherearealsootherannotationviewspossiblesuchastheAnnotation

StackandCoreference Editor• TrytheAnnotationStackview

Annotations

Date annotation

Annotations table

Editingexistingannotations• SelectanannotationtypefromtheAnnotationSetsviewandhover

overahighlightedannotationinthetext• Apopupwindowdisplaysmoreinformationaboutit:thisisthe

annotationeditor• Clickthedrawingpinsymbolatthetopoftheeditor.Thiswill“pin” the

windowopen(youcanstillmovethewindowaroundonyourscreenifyouwish)

• Tryeditingtheannotation:youcanchangetheannotationtype,featurenamesandvalues,thespanoftheannotation(clickingleftandrightarrowsatthetopofthebox)ordeletetheannotationoritsfeatures(redXs)

• ClosetheannotationeditorbyclickingtheXinthetoprightcorner,thenviewyoureditedannotationintheAnnotationList

Annotationeditor

annotation editorfeature name value

Annotation type

CreatingaCorpus

• Acorpusisacollectionofdocuments.• Wetendtorunapplicationsonacorpusratherthanonadocument

itself• FirstclosethedocumentsyouhaveloadedinGATE,justsowedon’tget

confused(rightclickandClose)• Nowcreateanewemptycorpus• RightclickonLanguageResources→New→GATECorpus• Youcangivethecorpusaname,orusethedefaultone

PopulatingaCorpus

• Sometimestherecouldbehundredsofdocumentsinacorpus.• Usingthepopulatefunctionmeansyoucanloadlotsofdocumentsinto

thecorpusinonego• RightclickonthenameofyournewcorpusintheResourcespaneand

select“Populate”• Selectthenameofthedirectorywithyourdocuments(hands-

on/corpora/news-texts)

• Allthedocumentswillbeloadedinonego• Viewadocumentorthecorpusbydoubleclickingonit

ProcessingResources

• Processingresources(PRs)arethetoolsthatprocessandannotatetext(textprocessingalgorithms).

• An“application”or“pipeline”consistsofanynumberofPRs,runsequentiallyoveracorpusofdocuments

• ANNIEcontainsPRsfortokenisation,sentencesplitting,POStagging,NamedEntityrecognitionetc.

Applications

Here'sonewemadeearlier:ANNIE

• ANNIEisaready-madecollectionofPRsthatperformsInformationExtractiononunstructuredtext.

• ClicktheiconfromthetopGATEmenuORSelectFile→LoadANNIEsystem

• ViewtheANNIEapplicationbydoubleclickingonit• RunANNIEonyourcorpus(selectthecorpusnameandclick“Run

thisapplication”)

Runninganapplication

PRs selected in application (in order of their execution)

Corpus on which the application is executed

Runtime parameters of the selected PR

Execute the application

Viewingtheresults

• WhenamessageappearsinthebottomleftcornerofyourGATEwindowsayingsomethinglike“ANNIErunin1.3seconds”,theapplicationhasfinished.

• Doubleclickonthedocumenttoviewit• ViewtheannotationsbyselectingAnnotationSetsand

clickingonanyAnnotationtypesintheDefault(unnamed)set

• Ifyouwant,youcanviewtheannotationstableorstackviewtoo.

• Rememberthatnotalltheresultswillbeperfect!

Plugins

• ApluginisacollectionofPRs,andotherresourcesbundledtogether.• EverythingneededforIEinANNIEisintheANNIEplugin.• EverythingneededforIEinFrenchisinthelang_french plugin.

• AnapplicationcanusePRsfromoneormoredifferentplugins.• InordertousePRs,youneedtoloadtherelevantplugin(s)• PluginsareloadedviathePluginManager(greenjigsawpieceicon)

Plugins

• ClicktheicononthetopGATEmenutoopenthePluginManager[orgoviaFile →ManageCREOLEPlugins]

Plugins

List of available pluginsResources in the selected pluginLoad the

plugin for this session only

Load the plugin everytime GATE starts

Apply all the settings

Close the plugins manager

Plugins

• Clickonaplugintosee(ontheRHS)thenamesoftheresourcesitcontains.Havealookatafew.

• NowloadtheToolspluginbycheckingtherelevant“LoadNow” boxforit

• Click“ApplyAll” toloadtheplugin• Click“Close”• RightclickonProcessingResourcestoseewhichnewPRsare

nowavailable

AddinganewPR

• Let'saddaVerbPhraseChunker PRtoANNIE.• First,wehavetoloadthepluginthatcontainsit,andthen

loadthePRintoGATE,beforewecanaddittotheapplication.

• Ifyouwerelookingclosely,you’llhavenoticedthattheToolspluginyoujustloadedcontainstheANNIEVPChunker.

• RightclickonProcessingResourcesandselect“New”→“ANNIEVPChunker”

• Leaveallthedefaultparameterssetandclick“OK”

AddinganewPR(2)

• NowweneedtoaddthenewPRtotheapplication.• DoubleclickonANNIE.• You'llseetheANNIEVPChunker isinthelistofloadedPRs.This

meansit'savailableinGATE,butisn'tyetcontainedintheapplication.

• Addittotheapplicationbyselectingitandusingtherightarrowtotransferit.

• Nowusetheuparrowtomoveittotherightplaceintheapplication.Itshouldgoafter(below)thePOStaggerbutbefore(above)theNEtransducer.

• Runtheapplicationandviewtheresultsonthedocument.• Youshouldseeanewannotationtype“VG”.

IEforotherlanguages

• Youcantryoutotherapplicationsintwoways• Loadapluginthatcontainsaready-madeapplication• ClickApplications->Readymadeapplications• Loadsomedocumentsandruntheapplication• Youcanalsojustloadablankdocumentandtypesometextinit• Ifyoudothis,youneedtorightclickonthedocumentandselect

“Newcorpuswiththisdocument”first• Runthenewapplicationonyourcorpus

NERinFrench

NERinArabic

Hands-onwithTwitIE

l TwitIE isaversionofANNIEthat’sbeenretrainedfortweetsl LoadtheTWITIEplugin(greenjigsawicon)l Nowright-clickon“Applications”,select“Ready-madeapplications”and

“TwitIE”l Createanewcorpus,nameit“Tweets”l Right-clickonthecorpusandselect“populatefromTwitterJSON”,

selectingthefilehands-on-materials/corpora/energy-tweets.jsonl Onceloaded,doubleclickonTwitIE toopenit,andthenselect“Runthis

application”(makesurethetweetscorpusischosen)l Lookatthedifferentannotationsinthedefaultannotationsetl ToseeTokensinhashtags,usetheAnnotationStackview

Analysing tweets

top related