language resources for multilingual europe
Post on 16-Aug-2015
148 Views
Preview:
TRANSCRIPT
META-NET has received funding from the EU’s Horizon 2020 research and innovation programme through the contract CRACKER (grant agreement no.: 645357). Formerly co-funded by FP7 and ICT PSP through the contracts T4ME (grant agreement no.: 249119), CESAR (grant agreement no.: 271022), METANET4U (grant agreement no.: 270893) and META-NORD (grant agreement no.: 270899).
Language Resources for Multilingual Europe
Georg RehmMETA-NET Network Manager – CRACKER Coordinator
DFKI, Germanygeorg.rehm@dfki.de
LT Innovate Summit – LR Dialogue Workshop, Panel “Language Resource Supply”Brussels, Belgium, June 25, 2015
META-NET and META
q
60 research centres in 34 countries(via four EU-funded projects: T4ME,CESAR, METANET4U, META-NORD)
q
Multilingual Europe Technology Alliance,794 members in 68 countries
http://www.meta-net.eu/members
http://www.meta-net.eu
q Pan-European infrastructure, bringing together providers and consumers of language data, tools and services.
q LRs are documented, uploaded, stored, catalogued, downloaded, shared – to improve visibility, documentation, identification, availability, interoperability.
q Caters for datasets, tools, services for LT research and development (both academic and commercial); META-SHARE includes repository software, a metadata model, licensing kit, statistics.
q 29 distributed repositories maintained by 37 organisations in 25 countries.
q 2.500+ resources (corpora: 49%, lexical: 38%, tools/services: 12%),covering ca. 100 languages.
q 7.000+ downloads in total; ca. 70% of all LRs have been downloaded.
MT
English
good
French, Spanish
moderate fragmentary
Catalan, Dutch, German, Hungarian, Italian, Polish, Romanian
weak or no support
Basque, Bulgarian, Croatian, Czech, Danish, Estonian, Finnish, Galician,
Greek, Icelandic, Irish, Latvian, Lithuanian, Maltese, Norwegian,
Portuguese, Serbian, Slovak, Slovene, Swedish, Welsh
excellent
English
good
Czech, Dutch, French, German, Hungarian,
Italian, Polish, Spanish, Swedish
moderate fragmentary
Basque, Bulgarian, Catalan, Croatian, Danish, Estonian, Finnish,
Galician, Greek, Norwegian, Portuguese, Romanian, Serbian,
Slovak, Slovene
Icelandic, Irish, Latvian, Lithuanian, Maltese, Welsh
weak/no supportexcellent
Res
ourc
es
Fragmentary
Weak/none
Moderate
Good
Excellent
Welsh
Maltese
Lithuanian
Latvian
Icelandic
Irish
Croatian
Serbian
Estonian
Slovene
Slovak
Roma
nian
Norwegian
Greek
Galician
Danish
Bulgarian
Basque
Swedish
Portu
guese
Finnish
Catal
anPo
lish
Hung
arian
Czech
Italia
nGe
rman
Dutch
Span
ishFre
nch
Engli
sh
Leve
l of s
uppo
rt
Languages with names in redhave little or no MT support
Language White Paper SeriesEurope’s Languages in the Digital Age (2011/2012)
Summary: “At Least 21 European Languages in Danger of Digital Extinction!”
http://www.cracker-project.eu • http://www.meta-net.eu
LR-Related Activities
2015 2016 2017
M12M1
M24
M36
Kick-off meetingfor all ICT-17Projects
translate5 WMT2016
WMT2017
IWSLT2015
IWSLT2016
IWSLT2017
QT Marathon2015
QT Marathon2016
Roadmap forEuropean MT
Research
Survey on the Stateof HQMT in Industry
and LSPs
SRIA(initial version)
SRIA(update)
SRIA(final)
version 2version 1
• Production of resources (e.g., for WMT 2016 and 2017, IWSLT 2015-2017)
• Tools for resources (quality control, evaluations; towards the idea of a smart workbench for translators)
• Strategies and roadmaps for resources (SRIA, Roadmap for European MT Research)
• Exchange and sharing facility for resources (META-SHARE)
Maintenance of Operations and Outreach • Provide services, adapt them to evolving user requirements and licensing landscape
• adapt, streamline and extend the metadata schema; • adapt licensing toolkit to new international licensing setups; • streamline and simplify operations for repository providers and data depositors.
• Technical support and bug fixing
http://www.cracker-project.eu • http://www.meta-net.eu
• Federation of projects – core seed: the group of H2020-ICT17 projects.
• Multi-lateral Memorandum of Understanding, ca. 20 projects in total (including FP7 and H2020-ICT15), to be approached in two phases (first phase almost completed).
• Selected areas of collaboration: data management and repositories (including Data Management Plan), tools and technologies; shared tasks and evaluations.
• http://www.cracking-the-language-barrier.eu will be launched soon.
MT Use Cases and Language Resourcesq “Usability” is an unusual generic dimension for the evaluation of a resource. q Reason: the majority of LRs can be used in many different research or application scenarios.q More relevant dimensions: quality, availability, coverage, maturity, sustainability, adaptability,
size, format, license, language, style etc. – depending on the use case.q When talking about LRs for MT, it’s important to be specific in terms of the respective use case. q Reason: the use case puts specific requirements on the type of LR and relevant dimensions.
Scenario MT Use Case Maturity of Technology
Human Involvement
Relevance of Quality Methods LR Requirements
Inbound Translation (written texts)
Gist transla-tion, provide an idea of a text’s contents
Deployed (Google Translate), research ongoing
– Quality of MT secondary
Statistical MT Very large aligned data sets (the more data, the better)
Outbound Translation (written texts)
Production quality, for publication
Research on HQMT has started, no POCs yet
– Quality of MT extremely important, ideally HQ
New approach needed, SMT, RBMT, hybrid systems (needs quality estimation methods)
Deeply annotated data sets with quality information (also needs more research)
Outbound Translation (written texts)
Production quality, for publication
Deployed, usable via LSPs
Post-editing Quality of initial MT step important but secondary
MT, followed by post-editing, ideally with smart translation workbenches (CAT)
Translation memories and term databases (large coverage, high quality etc.)
Speech to Speech Translation
Enable face-to-face conversations
Research ongoing but POCs exist (Skype)
– Quality of MT secondary
Recognition and generation of spoken language; statistical MT etc.
Several additional technologies and LR types needed (such as very large speech databases)
http://www.meta-net.eu 8
META-NET SRA LR Roadmap
q Infrastructure – maintain and extend sharing facility; promote documentation through metadata; intensify cooperation
q Coverage, Quality, Adequacy – increase number of LRs for all European languages to address application needs; promote evaluation and validation to improve LR quality constantly
q Acquisition – define best practices for LR production; automate production; distributed production (crowd-sourcing, social media, gamification etc.); bridge acquisition methods with LOD, big data
q Openness – elaborate simple and har-monised licensing solutions; promote openness and sharing of LRs
q Interoperability – promote and encourage use of standards
FLaReNet is a project funded under the eContentplus programme, grant agreement ECP-2007-LANG-617001. eContentplus is a multiannual Community programme to make digital content in Europe more accessible, usable and exploitable.
The Strategic Language Resource Agenda
Nicoletta Calzolari, Valeria Quochi, Claudia Soria
CNR - Istituto di Linguistica Computazionale “A. Zampolli”, Italy
with the contribution of
Núria Bel, University Pompeu Fabra, Spain
Gerhard Budin, Universität Wien, Austria
Khalid Choukri, ELDA, France
Joseph Mariani, LIMSI/IMMI-CNRS, France
Monica Monachini, CNR-ILC, Italy
Jan Odijk, Universiteit Utrecht, Netherlands
Stelios Piperidis, ILSP/”Athena” R.C., Greece
http://www.meta-net.eu
We need an LT Masterplan
q In 2015, LT is simply everywhere: search, interactive assistants (phones, cars, appliances), big data, social media analytics, etc. The potential is huge!
q Europe needs to follow a Language Technology Masterplan. Resources are only one piece of the puzzle, also needs to reflect technologies, tools, research, innovation, platforms, infrastructures, services, language policy making, the language communities, flagship initiatives (CEF, DSM), etc.
q Europe is only starting to recognise the potential of LT.
q LT will be a key ingredient of our future IT – with or without Europe.
q Europe has a unique opportunity for a strategic investment into our future growth.
http://www.meta-net.eu
DECLARATION OF COMMON INTERESTS We, the undersigned, declare here, at the Riga Summit on the Multilingual Digital Single Market, encouraged by the letter Vice President Andrus Ansip sent to its participants, that we stand united in our goal and interest to:
- support multilingualism in Europe by employing language technology in business, society and governance, to create a truly Multilingual Digital Single Market,
- exchange and share information in our efforts to promote our goals and interests at local, national and European levels,
- raise awareness in society at large using channels available to our associations, alliances and societies.
In the near future, we foresee the establishment of a Memorandum of Understanding among our organisations towards a “Coalition for a Multilingual Europe”, to better serve our members address the language barrier challenges towards establishing a truly integrated Multilingual Digital Single Market.
Riga, 29. April 2015
Signed by (in alphabetical order):
BDVA Laure Le Bars
CITIA Steve Renals
CLARIN Steven Krauwer
EFNIL Sabine Kirchmeier-Andersen, Tamás Váradi
ELEN Davyth Hicks, Claudia Soria
ELRA Nicoletta Calzolari, Khalid Choukri
GALA Laura Brandon, Robert E. Etches, Sergey Gladkov
LT Innovate Jochen Hummel, Philippe Wacker
META-NET Jan Hajic, Josef van Genabith, Georg Rehm, Andrejs Vasiljevs
NPLD Meirion Prys Jones
TAUS Jaap van der Meer W3C Richard Ishida, Felix Sasaki
For any questions, please contact Georg.Rehm@dfki.de.
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFT
DRAFTStrategic Agenda for the
Multilingual Digital Single Market
Technologies for Overcoming Language Barriers towardsa truly integrated European Online Market
DRAFT
Version 0.5 – April 22, 2015
The key ingredients are in place: the communities are ready, several strategic research agendas were prepared, e.g.,:
10
META-NET SRA MDSM SRIA Riga Summit Declaration
Enable multilingual communication through web scale platform (also: Multi-
lingual Digital Single Market)
Software engineering project; “one size fits all” approach; low risk of failure; increased security and data protection
Web service (including APIs) that makes use of SMT
methods and large data sets
Web service platform for LT/MT research and innovation (hybrid research, continuous development and operations)
Enable the testing of new methods and avantgarde
approaches with very large amounts of users
European research and innovation platform for novel LT/MT ideas and specialised services (e.g., genres, styles,
registers etc.)
Translingual Cloud
Web service platform for human translators and LSPs
Enable hand-in-hand operations of MT and human
translation; enable high-quality human translation
Establish a sustainable technological link between
human and machine (e.g., via human-generated and
human-annotated data sets)
http://www.meta-net.eu 11
top related