weblicht tool integration...statistics with piwik • weblicht uses the clarin installation of the...
TRANSCRIPT
www.clarin-d.net
WebLicht ToolIntegration
ErhardHinrichs,MarieHinrichs,WeiQiuUniversityof Tübingen
www.clarin-d.net
Introduction
• WebLicht (Web-basedLinguisticChainingTool)– Motivation– Overviewofthearchitecture– TCF(textcorpusformat)data-exchangeformat
• Integration– RepositorySets– Comet(CMDIorchestrationmetadatatool)– CenterRegistryUpdate
• ToolTesting– Bombard(toolscalabilitytesting)– Awesome(verifyactualtoolinput/outputagainstmetadata)
• UsageStatistics– Piwik
• Links
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 1
www.clarin-d.net
Motivation
Manynaturallanguageprocessingtoolsmustbeinvokedfromthecommandlineorfromprogrammingcode.• Sometimesdifficulttoinstall• Notalwaysavailableforalloperatingsystems• Confusingforthosenotaccustomedtorunningcommandlinetoolsorwhodon’tliketoprogram
• Notpossibletocreatesomeannotationswithonetoolandotherannotationswithadifferenttool– Input/Output formatsdiffer
2WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia
www.clarin-d.net
Overview
WebLicht• MakesNLPtoolsavailableonlineandenablesuserstomix-and-matchtoolstosuittheirneeds.
• TheWebLichtwebinterfaceguidestheconstructionandexecutionofprocessingchains,andprovidesvisualizationofresults.
• NLPtoolsaremadeavailablebyCLARINcentersaswebservices.
3WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia
www.clarin-d.net
Overview
WebLichtisanexecutionenvironmentfornaturallanguageprocessingpipelines:● Usesaservice-orientedarchitecture(SOA)● Webservicesarecombinedtoformachain● ChainsareexecutedviasequentialHTTPPOSTrequeststoservicesonthechain
● Theoutputofservicen istheinputtoservicen+1 inthechain● MostservicesuseTextCorpusFormat(TCF)astheirinputandoutput
● Servicesaddoneormoreannotationlayer(s)
4WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia
www.clarin-d.net
MainComponents
• Services fordataprocessing,distributed• Repositories containingmetadataabouttheservices• Harvester
– Collectsservicemetadatafromrepositories• Chainingengine
– Guidestheconstructionofvalidchains– Executeschains
• Webinterfaceforbuildingandexecutingchains• WaaS (WebLicht asaService)
– RESTfulwebservicetoexecutechains– Canbecalledfromcommand-line,scripts,etc.
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 5
www.clarin-d.net
ComponentInteractions
6WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia
www.clarin-d.net
TCF:Motivation
TCF(textcorpusformat)isWebLicht’s data-exchangeformat• Servicechainsaremeanttobebuiltflexibly– Mix-n-match,Legostyle– Not“oneservicedoesitall”
• Outputofservicen isinputofservicen+1• Servicesrelyontheannotationsintheinput• Servicesmustbeableto– Readannotationsproducedbypreviousservices– Writeannotationsforusebylaterservices
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 7
www.clarin-d.net
TCF:Details
• Eachserviceaddsoneormoreannotationlayers• Servicesarenotpermittedtochangeorremoveexistingannotations
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 8
www.clarin-d.net
TCF:Details
TCFisanXMLformatforstoringlinguisticannotations• Text,tokens,sentences,lemmas,part-of-speech,morphology,parsing,etc
• Standoffformat– EachtypeofannotationisstoredinaseparateXMLelement
– tokenlayercanbeseenasthecentral,atomicelementtowhichotherannotationlayersrefer
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 9
www.clarin-d.net
TCF:Example
<?xmlversion="1.0"encoding="UTF-8"?><D-Spinxmlns="http://www.dspin.de/data"version="0.4"><MetaData xmlns="http://www.dspin.de/data/metadata">
Theexecutionenginerecordsmetadataabouttheservicesusedtocreatethedocumenthere.
</MetaData><TextCorpus xmlns="http://www.dspin.de/data/textcorpus"lang="de"><text>Karinfliegt nach NewYork.Sie willdort Urlaub machen.</text>
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 10
www.clarin-d.net
TCF:Example
<tokens><token ID="t_0">Karin</token><token ID="t_1">fliegt</token><token ID="t_2">nach</token><token ID="t_3">New</token><token ID="t_4">York</token><token ID="t_5">.</token><token ID="t_6">Sie</token><token ID="t_7">will</token><token ID="t_8">dort</token><token ID="t_9">Urlaub</token><token ID="t_10">machen</token><token ID="t_11">.</token>
</tokens>
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 11
• EachtokenhasauniqueID
• Allotherannotationlayers(exceptthetextlayer)referencetokensdirectlyorindirectly
www.clarin-d.net
TCF:Example
<sentences><sentenceID="s_0"tokenIDs="t_0t_1t_2t_3t_4t_5"></sentence><sentenceID="s_1"tokenIDs="t_6t_7t_8t_9t_10t_11"></sentence>
</sentences>
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 12
www.clarin-d.net
TCF:Example
<POStags tagset="stts"><tagID="pt_0"tokenIDs="t_0">NE</tag><tagID="pt_1"tokenIDs="t_1">VVFIN</tag><tagID="pt_2"tokenIDs="t_2">APPR</tag><tagID="pt_3"tokenIDs="t_3">NE</tag><tagID="pt_4"tokenIDs="t_4">NE</tag><tagID="pt_5"tokenIDs="t_5">$.</tag><tagID="pt_6"tokenIDs="t_6">PPER</tag><tagID="pt_7"tokenIDs="t_7">VMFIN</tag><tagID="pt_8"tokenIDs="t_8">ADV</tag><tagID="pt_9"tokenIDs="t_9">NN</tag><tagID="pt_10"tokenIDs="t_10">VVINF</tag><tagID="pt_11"tokenIDs="t_11">$.</tag>
</POStags>...
</TextCorpus></D-Spin>
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 13
www.clarin-d.net
TCF:Processing
• MoredetailedinformationaboutTCFisavailableintheDevelopersManualontheweblicht-wiki:– schema– onlinevalidator– Tutorialforreading/writingTCFusingtheJavalibrary
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 14
www.clarin-d.net
BuildingaTool
ToDo’s forimplementingaWebLicht tool:• Gatherresourcesneededforthetool– Models,lists,etc
• Implementthetoolasawebservice• Deploythewebserviceonapublicserver
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 15
www.clarin-d.net
IntegratingaToolintoWebLicht
MakingatoolvisibletoWebLicht:• Createaset inyourcenter’srepositoryforWebLicht webservices
• CreateCMDImetadataforthetoolandplaceitintherepositoryset
• AssignaPIDtothetool• Updateyourcenter’sdataintheCenterRegistrytoenableharvestingbyWebLicht
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 16
www.clarin-d.net
RepositorySet
• EachcenterofferingWebLicht servicesmustcreateaset intheirrepositoryformetadataaboutthosewebservices– MetadataforWebLicht servicesshouldbepartofthisset– TheWebLicht harvesteronlyretrievesthosesets– Avoidsharvestingofunnecessarydata
• Procedureforcreatingaset dependsontherepositorysoftware(fedora,dspace,…)
• Setnamecanbeanything,butkeepitsimple(nospacesorspecialcharacters)– e.g.WebLichtWebServices
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 17
www.clarin-d.net
CreatingToolMetadatawithComet
Comet(CMDOrchestrationMetadataEditingTool)canbeusedtocreate CMDImetadatausingtheWebLichtWebService profile• Twomainsections– GeneralInformation
• Toolname,description,PID,…– "Orchestration" Information
• Toollanguage,input/outputspecification
• Neededbythechainingandexecutionengineto– Guideusersinbuildingvalidchains– Executeaprocessingchain
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 18
www.clarin-d.net
CreatingToolMetadataWithComet
CometcanbeusedtocreatemetadataforaWebLicht webservice:• createfromscratch• uploadmetadataforediting
• useexistingmetadataasatemplate
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 19
www.clarin-d.net
CreatingToolMetadataWithComet
GeneralInformation
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 20
www.clarin-d.net
CreatingToolMetadataWithComet
OrchestrationInformation
Input(TCF):• tokens• lang(de|en|fr|it)
Output(addstoinput):• lemmas• part-of-speechtags– tagset dependsoninputlanguage
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 21
www.clarin-d.net
UpdateCenterRegistryData
Ask your CenterRegistrycontact person to add these elements to yourcenter‘s data:• WebServicesSet
– Thename of the set inyour repository containing CMDImetadata for WebLicht webservices
– No spaces or special characters allowed– e.g.WebLichtWebServices
• WebServiceType:WebLicht• MetadataScheme:CMDI• OaiAccessPoint
– URLfor harvesting– e.g.http://repository.my.org.countrycode/oaiprovider
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 22
www.clarin-d.net
ToolTesting
Someusefultestingsoftwarefortooldevelopers:• Bombardforscalabilitytesting• Awesomefortestingmetadatacorrectness
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 23
www.clarin-d.net
ToolTesting:Bombard
Bombardforscalabilitytesting• Simulatesusersinvokingtoolchains• Flexibletest-caseconfiguration• Multipletestcasescanberunsimultaneouslyduringone“bombardment”
• Reportsstatisticsforeachtool:– successes/failures– averageprocessingtimes
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 24
www.clarin-d.net
ImprovingScalability
One way to improve scalability of webservices:
• Jesque - adistributedtaskqueueframework• Exploittheparallelprocessingcapabilitiesofmodernserversandcomputingclusters– parallelprocessingwithinrequests– concurrentprocessingofrequests– guaranteeswithrespecttousageofresources– fairness(smallrequestsshouldnothavetowaitforlargeones)
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 25
www.clarin-d.net
ToolTesting:Awesome
Awesome(AutomatedWEbService OrchestrationMonitoringEnvironment)• Teststhatservices’orchestrationmetadatamatchesactualusage
• POSTsasmalltestfilecontainingtherequiredinputannotationstotheservice
• Checksthattheoutputreturnedcontainstheexpectedadditionalannotationlayers
• Runstestsautomaticallyatregulartimeintervals– Reportsgroupedbycreator,errorcode,orseverity
• Cantestindividualservices– SpecifyPID,inputfile,expectedoutputfile
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 26
www.clarin-d.net
ToolTesting:Awesome
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 27
www.clarin-d.net
StatisticswithPiwik
• WebLicht usestheCLARINinstallationoftheanalyticsplatformPiwik togatheruserstatistics
• EachwebserviceinvocationfromWebLicht orWaaS isrecorded
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 28
www.clarin-d.net
CurrentTools
• Languages:– German,Englishmostrepresented– SometoolsforFrench,Italian,Spanish,…– Needsupportformorelanguages
• Tools:– Tokenization/Sentencesplitters– POStaggers– Lemmatizers– Morphologicalanalyzers– Parsers
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 29
www.clarin-d.net
WebLicht:Login
http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 30
www.clarin-d.net
Links
• DeveloperManual– TutorialforcreatingWebLicht webservices– TCFindetail
• Comet forcreatingtoolmetadata• Bombard:wlsupport[at]sfs.uni-tuebingen.de• Awesome testswebservicesagainstmetadata• Harvester listofharvestedWebLicht services• Jesque toimprovescalability• WaaSWebLicht asaService
WebLichtToolIntegration,10.11.2016,Ljubljana,Slovenia 31
www.clarin-d.net
ThankYou!