an approach to open source nlp tools for galician as minoritized variety of portuguese in spain v002

63
An approach to open source NLP tools for Galician as a minoritized variety of Portuguese in Spain José Ramom Pichel Campos R&D Director imaxin|software www.imaxin.com

Upload: jose-ramom-pichel-campos

Post on 18-Jul-2015

95 views

Category:

Technology


0 download

TRANSCRIPT

An approach to open source NLP tools for Galician as a minoritized variety of Portuguese in Spain

José Ramom Pichel CamposR&D Directorimaxin|software

www.imaxin.com

1. Imaxin|software2. Global Languages/Minority/Endangered/Minoritized Languages3. What are the most important challenges of Minority/Endangered/Minoritized Languages in relation to develop Natural Language Processing tools?4. Galician-Portuguese Language as a sample of Minority/Minoritized language to develop open source/proprietary NLP tools

www.imaxin.com

Minority, Endangered and Minoritized are not global languages, so ....

What should we take into account to develop Natural language processing tools for Minority, Endangered and Minoritized languages?

Galician as an example

www.imaxin.com

Generally, when Computational Scientists want to approach Languages and Computers...They are thinking of....Computers

www.imaxin.com

....But languages are spoken by people.....

www.imaxin.com

..... in Societies....

www.imaxin.com

So, before developing NLP tools for Endangered Languages you should think of Language in a Society

www.imaxin.com

Sociolinguistics is the descriptive study of the effect of any and all aspects of society, including cultural norms, expectations, and

context, on the way language is used, and the effects of language use on society.

www.imaxin.com

Basic issues of Sociolinguistics to focus on better developments of NLP tools for languages

What about Global Languages?What about Minority Languages?What about Endangered Languages?What about Minoritized Languages?And finally, when we approach to a language, is it a different languages or a variety of a language?

www.imaxin.com

Global Languages (by Wikipedia)

A world language is a language spoken internationally and which is learned by many people as a second language.

A world language is not only characterized by the number of speakers (native or second language speakers), but also by its geographical distribution, international organizations and in diplomatic relations.

www.imaxin.com

Global Languages

"A language is a dialect with an army and navy"Sociolinguist and Yiddish scholar Max Weinreich

www.imaxin.com

www.imaxin.com

What about next Global Languages?

www.imaxin.com

Is Mandarin Chinese an easy language to learn?

é o chinês mandarim uma linguagem fácil de aprender?

是中国普通话的容易学的语言?

www.imaxin.com

The historical reason for this is the period of expansionist European imperialism and colonialism (and the more powerful economies and armies in the world)

(English, French, Spanish, Portuguese, Dutch, etc.)

"A language is a dialect with an army and navy"Sociolinguist and Yiddish scholar Max Weinreich

Language = dialect + army + navy

Dialect = Language – (army + navy)

Dialect: Minority > Minoritized > Endangered

www.imaxin.com

Minority Languages (by Wikipedia)

A minority language is a language spoken by a minority of the population of a territory. Such people are termed linguistic minorities or language minorities.

www.imaxin.com

Minority Languages

www.imaxin.com

Endangered Languages (by Wikipedia)

An endangered language is a language that is at risk of falling out of use as its speakers die out or shift to speaking another language. Language loss occurs when the language has no more native speakers, and becomes a "dead language". If eventually no one speaks the language at all, it becomes an "extinct language".

www.imaxin.com

Endangered Languages (by Wikipedia)

........While languages have always become extinct throughout human history, they are currently disappearing at an accelerated rate due to the processes of globalization and neo-colonialism, where the economically powerful languages dominate other languages.

http://www.voanews.com/content/rosetta-project-preserves-key-to-endangered-languages/1713317.html

www.imaxin.com

Minoritized language

Minoritized language is a term that refers to sociolinguistic languages that have suffered marginalization, persecution or even banning at some point in their history. It's therefore a concept that highlights the presence of an enforcement action leading to a cut in use.

Minority and Minoritized Language are not synonymous

www.imaxin.com

Minoritized language

www.imaxin.com

Different languages or different varieties of the same language?

As we know, “A language is a dialect with an army and navy" is a quip about the arbitrariness of the distinction between a dialect and a language. It points out the influence that social and political conditions can have over a community's perception of the status of a language or dialect.

The adage was popularized by the sociolinguist and Yiddish scholar Max Weinreich, who heard it from a member of the audience at one of his lectures.

www.imaxin.com

Different languages or different varieties of the same language?

www.imaxin.com

www.imaxin.com

Natural Language Processing Tools for any kind of languageSpell-checkers, Grammar-Checkers, Machine Translation, Lemmatizer, Morphological Analyzer, POSTagger, etc.

What are the most important challenges of Minority/Endangered/Minoritized Languages in relation to develop Natural Language Processing tools?

www.imaxin.com

1. Is there a stable written standard?

2. Is there a prescriptive authority of written standard language?

3. What is our target? Kids? Old-people?

4. Kind of Language (Minoritized, Minority and Distance between other Languages)

5. Is there a recognized grammar?

6. Are there enough monolingual and bilingual corpus ?

www.imaxin.com

1. Is there a stable written standard?

A written language is the representation of a language by means of a writing system. Written language is an invention in that it must be taught to children; children will pick up spoken language (oral or sign) by exposure without being specifically taught.

A standard language (also standard dialect or standardized dialect) is a language variety used by a group of people in their public discourse.

www.imaxin.com

1. Is there a stable written standard?

Nynorsk and BokmålNynorsk was developed by the linguist Ivar Aasen in the 1850s, based on rural, spoken Norwegian, rather than the cultured, Danish-influenced Norwegian spoken in cities. Its first official codification was in 1901, was given the name Nynorsk in 1929, and has been used officially (alongside Bokmål) since 1938.

www.imaxin.com

1. Is there a stable written standard?

Nationalist and Reintegrationist GalicianThe nationalist considers Galician and Portuguese to be two distinct languages, despite the fact of the two being closely related. Nationalist favour differentiated rules of writing and spelling between Galician and Portuguese. In this fashion, Galician spelling follows the model of Spanish orthography. This view is held by the majority of public and Government organizations. Its standard norm, the "NOMIGa", is elaborated by the Real Academia Galega (Royal Galician Academy) and the Instituto da Língua Galega (Institute for Galician Language).

www.imaxin.com

2. Is there an official authority of written standard language?

www.imaxin.com

3. What is our target? Children? Old-people?

www.imaxin.com

3. ChildrenTechnological skillsEducational skills in language standardVery often, they don't speak their grandparents' language on a daily basis

www.imaxin.com

3. Old peopleNo technological skillsNo educational skills in language standardThey don't speak standard languageThey are ashamed of their own language (it is useless, they think)

www.imaxin.com

4. Kind of Language (Minoritized, Minority and Distance between other Languages, Endangered)

MT MN Example0 1 Luxembourgish1 0 Galician-Portuguese, Catalan-Valencian 1 0 Catalan1 1 Galician-Spanish, Valencian, Aragonese, Friulan, Asturian

www.imaxin.com

5. Is there a recognized grammar?

www.imaxin.com

6. Is there a good enough monolingual and bilingual corpus ? Is it open source?

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

http://www.statmt.org/europarl/

www.imaxin.com

6. Is there a good enough monolingual and bilingual corpus? Is it open source?

Tesouro galego-portuguêshttp://ilg.usc.es/Tesouro/pt/

www.imaxin.com

GALIZA, as a sample to learn to develop nlp tools for minority/endangered/minoritized languages

www.imaxin.com

www.imaxin.com

Political Map versus Languages Map

www.imaxin.com

Our language: Galician (Globally known as Portuguese)

https://www.youtube.com/watch?v=RPRxAcckmUA

Galiza is a sociolinguistics lab to develop NLP tools

www.imaxin.com

1. Is there a stable written standard?

2. Is there a prescriptive authority of written standard language?

3. What is our target? Kids? Old-people?

4. Kind of Language (Minoritized, Minority and Distance between other Languages)

5. Is there a recognized grammar?

6. Are there enough monolingual and bilingual corpus ?

www.imaxin.com

Galiza, as a sociolinguistics lab implies another point of view on how to develop natural language processing tools

Galician-Spanish: Minoritized + Minority Language

“Vou facer o camiño de Santiago. Ao chegar a Galicia podes ver polas montañas moitos carballos.”

Galician-Portuguese: Minoritized Language

“Vou fazer o caminho de Santiago. Ao chegar à Galiza podes ver polas montanhas muitos carvalhos.”

www.imaxin.com

Galician-Spanish: Minoritized + Minority Languagehttp://www.xunta.es/linguagalega/ferramentas_informaticas

FeaturesYou have to develop software from scratch (high investment).In case of open source, galician is high-dependent on volunteers and Public investment.Huge diversity on terminologyPrivate software depends on strategy of big companies (Microsoft, Apple, Sun, etc.)Interferences with Spanish

www.imaxin.com

Spell-Checkers

Minoritized errors + Spell-checker errors

http://www.xunta.es/linguagalega/galgo

www.imaxin.com

Monolingual and bilingual corpora

http://sli.uvigo.es/RILG/http://webs.uvigo.es/sli/recursos_en.html

www.imaxin.com

Monolingual dictionarieshttp://www.realacademiagalega.org/dicionario

www.imaxin.com

Games and NLPhttp://portaldaspalabras.org/

www.imaxin.com

GalNET: Ontology of Galician-Spanishhttp://sli.uvigo.es/galnet/galnet_var.php?ili=ili-30-12090890-n

www.imaxin.com

Galician-Portuguese: Minoritized Languagehttp://gramatica.usc.es/~gamallo/http://www.estraviz.org/

FeaturesYou have just to customize software from Portuguese state-of-art, in case of necessityIn case of open source, galician is less-dependent on volunteers and Public investment.Less diversity on terminology (based on Portuguese and Brazillian choices)Big companies (Microsoft, Apple, Sun, etc.) are more interested in localize galician varietyInterferences with SpanishIncrease more open source because of you are using open source from Portugal and Brazil

www.imaxin.com

Open source Spell-checkerhttp://extensions.libreoffice.org/extension-center/corrector-ortografico-para-galego

www.imaxin.com

Open Source Non-sexist Grammar Checkerhttp://www.exeria.net/que.php

www.imaxin.com

Open Source Grammar Checkerhttp://wiki.mancomun.org/index.php/Golfi%C3%B1o._Corrector_gramatical_para_OpenOffice.org

www.imaxin.com

Natural language processing tools (Galician-Portuguese)Open source

http://www-nlp.stanford.edu/links/statnlp.htmlhttp://gramatica.usc.es/~gamallo/

www.imaxin.com

FLIP Portuguese spell-checker and Galician

www.imaxin.com

Machine Translation

Open Source Machine Translation(Apertium and Matxin)

http://www.opentrad.com

Google Translatehttps://translate.google.com/

www.imaxin.com

Machine Translation

www.imaxin.com

Opentrad (Apertium)RBMT

www.imaxin.com

Moses (SMT)

www.imaxin.com

ILP Paz Andrade

www.imaxin.com

Trip to Endangered Languages

This is a blog about a trip through Endangered Languages in Europe. I'm convinced that Human Language Technologies can save them from disappearance.

http://tripendangeredlanguages.wordpress.com/

www.imaxin.com

Thanks a lot!Obrigado!@[email protected]

https://www.youtube.com/watch?v=l3zzn4k9dRE

www.imaxin.com

salgueiriños de abaixo nº11 L615703 Santiago de Compostela (A Coruña)voz. +34 981 554 068 [email protected] Facebook: www.facebook.com/imaxinsoftwareTwitter: @imaxinsoftware

www.imaxin.com