language resources, standardization and modern trends in nlp simon krek jožef stefan institute,...

Language resources, standardization and

modern trends in NLPSimon Krek

Jožef Stefan Institute, Artificial Intelligence Laboratory, Slovenia

COST Action

Working Groups / Objectives

• WG1: Integrated interface to European dictionary content

• WG2: Retro-digitized dictionaries

• WG3: Innovative e-dictionaries

• WG4: Lexicography and lexicology from a pan-European perspective

Innovative e-dictionaries

• The third working group will focus on the development of digitally born dictionaries, focusing on the latest developments in e-lexicography and the interface between lexicography and computational linguistics. • Work will be carried out on:• the analysis of the possible impact of automatic acquisition of lexical data• the analysis of the interface between dictionary and computational lexica (cf.

wordnets) and syntactically and semantically annotated corpora (cf. FrameNet, SemCor, Senseval)• the investigation of the possible use of dictionary content for computational

linguistic applications

Electronic lexicography in the 21st century• The first eLex conference: New challenges, new applications, Louvain-

la-Neuve (Belgium), 22 to 24 October 2009• The second eLex conference: New applications for new users, Bled

(Slovenia), 10 to 12 November 2011• The third eLex conference: Thinking outside the paper, Tallinn

(Estonia), 17 to 19 October 2013• The fourth eLex conference: Linking Lexical Data in the digital age,

Herstmonceux Castle (UK), 11 to 13 August 2015

eLex 2011Language data for digital natives: old wine in a new bottle or...?

Text mining is a challenge

Content is a problem

Presentation is a bigger problem

What is in the middle?

(Web, Mobile) Design

Lexicography

Natural

Langua

ge

Process

ing

?

Text mining is a challenge

Content is a problem

Presentation is a bigger problem

Sinclair: Floating dictionary (2001)

• »A few years ago I felt that the time was ripe to plan a new kind of dictionary, one that would never exist on paper, but would be automatic or almost automatic in its selfupdating.

• It would, so to speak, float on top of a corpus, rather like a jellyfish, its tendrils constantly sensing the state of the language.

• As well as reporting on the settled usage and meanings of the words and phrases of a language, like a normal dictionary does, the floating dictionary, when interrogated, dips into the corpus and checks this information, offering instances that match its criteria for the senses; also it explores further to see if there are any instances that conflict with the criteria, and may signify a development of a sense or the emergence of a new usage altogether.

• Within the limits of its powers, it organises this evidence as a comment on the existing dictionary entry.«

Does dictionary content know itself?

• LT community now has a basic idea how to store various types of information• also SW community: RDF, RDFa, RDFS, OWL, SKOS, and more• standardization in human-oriented dictionary encoding was never

really successful (XML, TEI?)• the question is: if different types of lexicographic information

intended for human users will have to know each other – will the format be dictated by LT standards? (Probably yes.)

Similar domain, different task

• EU projects: http://www.xlike.org/, http://xlime.eu/

• The goal of the XLike project is to develop technology to monitor and aggregate knowledge that is currently spread across mainstream and social media, and to enable cross-lingual services for publishers, media monitoring and business intelligence.• xLiMe proposes to extract knowledge from different media channels

and languages and relate it to cross-lingual, cross-media knowledge bases. By doing this in near real-time we will provide a continuously updated and comprehensive view on knowledge diffusion across media.

http://www.xlike.org/

http://xlime.eu/

Sevices

• Newsfeed• a clean, continuous, real-time aggregated stream of semantically enriched

news articles from RSS-enabled sites across the world• http://newsfeed.ijs.si/visual_demo/• http://enrycher.ijs.si/

• EventRegistry• a system that can analyze news articles and identify world events• can identify groups of articles in different languages that describe the same

event • http://eventregistry.org/

http://newsfeed.ijs.si/visual_demo/

http://enrycher.ijs.si/

http://eventregistry.org/

EventRegistry system architecture

ENeL perspective

• Complex story about events = complex story about words/languages

Slovene Estonian English German French Hungarian Croatian Basque Swedish …

Cross-lingual horizontal axis

Diachronic vertical axis

2015 1950 1900 1850 1800 …

Cross-lingual synchronic horizontal axis• "Never without data"• Existing lexical resources (dictionaries, BableNet, AnyNet, Linked Data, etc.)• Corpora, the Web and NLP

• Definition extraction (and generation)• RANLP 2009, International workshop on definition extraction• Language Technology for eLearning (http://www.lt4el.eu/)

• Extraction of grammatical or lexical information• Kookkurrenzdatenbank (http://corpora.ids-mannheim.de/ccdb/)• Sketch Engine (http://www.sketchengine.co.uk/)

• Extraction of good (dictionary) examples• ENeL Vienna workshop

• Extraction of translation equivalents• Linguee etc.

• Extraction of Multi-word Expressions (Parseme)

http://www.lt4el.eu/

http://corpora.ids-mannheim.de/ccdb/

http://corpora.ids-mannheim.de/ccdb/

http://www.sketchengine.co.uk/

Automatically Constructed Dictionary Content

Complex multimodal information extraction

Explain, combine, exemplify

Definitions

Found

Generated

Combinations

Collocations

as subject

as object

Multi-word expressions

Knowledge-Rich Contexts

Real-time data

Streaming

Twitter

News Feeds

Sounds, graphics and visuals

Sounds

Speech Synthesis

Recorded / Speech

Recognition

Graphics

Images

Videos

Multi-lingual, cross-lingual

(Hidden) parallel corpora

hub language

ENeL

• WG1: Integrated interface to European dictionary content

• WG2: Retro-digitized dictionaries

• WG3: Innovative e-dictionaries

• WG4: Lexicography and lexicology from a pan-European perspective

Retro-digitization

• Digital Agenda for Europe (Europe 2020 Strategy – one of the pillars)• Commission’s Recommendation on the

digitization and online accessibility of cultural material and digital preservation

• Put in place solid plans for their investments in digitization and foster public-private partnerships to share the gigantic cost of digitization (recently estimated at € 100 billion).

• Make 30 million objects available through Europeana by 2015, including all Europe's masterpieces which are no longer protected by copyright, and all material digitized with public funding.

Retro-digitized dictionaries

• encode and enrich dictionary data (standards and tools)• (the question is: if different types of lexicographic information

intended for human users will have to know each other – will the format be dictated by LT standards?)• definitions• examples• etymology• other types of information

• linking dictionary data with historical corpora• http://nl.ijs.si/imp/

http://nl.ijs.si/imp/

http://nl.ijs.si/imp/

Lexical Cloud

Integrated interface to European (dictionary / lexical) content

Any dictionary

Anypedia

AnyNet Any corpus

Any base

Conclusion

• any word/concept in any language on any device offers a story about its current life and its history• what is a "concept" (in the sense of "event")? X-Nets? Wikipedia?• what is the central format?

• what is the appropriate context?• EU projects? ICT? Cultural Heritage?• Infrastructure (e.g. Clarin)?

language resources, standardization and modern trends in nlp simon krek jožef stefan institute,...

Documents

new kind of dictionary

floating dictionary

slovenia slide

normal dictionary

bigger problem slide

new applications

new users

new usage