weblicht application and “workspaces”
Post on 22-Feb-2016
48 Views
Preview:
DESCRIPTION
TRANSCRIPT
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht Application and “Workspaces”
Erhard Hinrichs & Thomas ZastrowUniversity Tübingen
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
Outline
Web-based Linguistic Chaining Tool (WebLicht) for incremental filtering and access of language corpus data
WebLicht – Motivation WebLicht - Architecture WebLicht – Future Requirements Test Case – Gutenberg Corpus
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
CLARIN Mission
CLARIN (Common Language Resource and Technology Infrastructure Network)
• is committed to establishing an integrated and interoperable RI supporting easy access and use of language
• aims to overcome the current fragmentation and offer a stable, persistent and extendable infrastructure
• it will offer its services to researchers and scholars across a wide spectrum of domains in particular in the humanities and soc sciences
• ESFRI roadmap project; implementation phase starts in 2011
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
Typical CLARIN user scenario
Scenario: A PhD student investigates regional differences in vocabulary and in word collocations in different variants of German .
Data: large text corpora available at BBAW in Berlin, at the Austrian Academy of Science in Vienna, the Swiss Text Corpus Project in Basel, and at EURAC, Bolzano.
Tools for targeted data access: WebLicht offers customizable chains of web services for filtering and analyzing the data
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht - Motivation
• Many linguistic resources (corpora, dictionaries, …) and tools (tokenizer, tagger, parser, …) are available
• Most of them are implemented to run on local machines. This can be inconvenient and error-prone
• Requirements: go beyond “do-it-yourself” and “download-first” strategies
• The CLARIN solution: Make tools and resources available as webservices
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht - Architecture
WebLicht is a SOA for accessing and processing text corpora
Development started in October 2008 WebLicht consists of the following components:
Distributed services: offering functionality (resources & tools) over the (inter-)net. Implemented as webservices (ca. 90 at the moment)
Repository: stores metadata and technical information about the services
Web 2.0 based user interface: interacts with the user and combines services and information from the repository. Access still possible via scripts / programming code
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht - Architecture
Web 2.0 Application forTool Chainingand Execution
Repository
StuttgartTübingen Berlin Leipzig Finland
Standard-conformantText Corpus Encoding
Stuttgart Tübingen Leipzig
Romania Iceland UK
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht – Architecture
Services are implemented as REST style webservices
HTTPs POST method is used to send data from the UI to the services
As client, anything which is able to use the HTTP protocol, can be used: Browser Commandline tools (wget, curl) Programming Languages
Anyone can implement his/her own interface to WebLicht
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht - Processing Chains
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht - Results
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht - Results
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht - Features
With RESTstyle webservices, everyone can implement a web service for WebLicht (4pages tutorial)
The SOA infrastructure is independent of programming languages or operating systems
The chaining algorithm is independent of the used dataformat
Form a legal point of view, the web services are still located in the institute where they were created
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht – Future Requirements
Web services are synchronous: some linguistic annotation processes are very time consuming an asynchronous behavior of these service would be
desirable The processing power is limited by local computing
resources Scalability only with strong centers possible
The current architecture is not sufficiently parallelized and therefore does not scale up: Accommodate a large number of simultaneous users Parallelization of processes
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht – Future Requirements
Currently, users have to store the input data and their results on their local machines Online storage in the form of personal workspaces
with reliable backup solutions Linguistic tools are typically developed in a variety
of heterogeneous software environments and programming languages (Java, Perl, Python, C/C++, Prolog, Lisp, …) Encapsulation of individual services with common APIs
for interoperability Currently, WebLicht services are limited to
processing text corpora Extending webservices also to spoken language and
multi-modal datasets (MPI is already working on this)
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
Test Case: Gutenberg Corpus
On the basis of these structure, a part of the free available Gutenberg Project was annotated in Tübingen
Ca. 20.000 texts from 800 authors Runtime: ca. 3.5 weeks Result:
217 million tokens (words), 533 million constituents, 110 GB data
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
Gutenberg Corpus – Analyzing
Fulltext index (Lucene) Database for the linear part of the data Tree-like structures can be analyzed with XML based
techniques (Xpath, Xquery) DOM based techniques are slow and performance
hungry
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
Links etc.
Clarin Homepage: http://www.clarin.eu The D-Spin homepage: http://www.d-spin.org WebLicht (login via DFN AAI):
https://weblicht.sfs.uni-tuebingen.de/
Erhard Hinrichs, Thomas ZastrowSeminar für Sprachwissenschaft
Universität Tübingen
Wilhelmstr. 19D-72074 Tübingen
thomas.zastrow@uni-tuebingen.deErhard.hinrichs@uni-tuebingen.de
WebLicht Application and Workspaces
MunichSeptember 2010
www.d-spin.org
WebLicht - Combinations
top related