nlp pipeline in machine translation

24
The NLP pipeline in Machine Translation Mārcis Pinnis

Upload: marcis-pinnis

Post on 19-Mar-2017

109 views

Category:

Science


4 download

TRANSCRIPT

Page 1: NLP pipeline in machine translation

The NLP pipeline inMachine Translation

Mārcis Pinnis

Page 2: NLP pipeline in machine translation

Overview• Short introduction• The NLP Pipeline in Machine Translation• Selected tasks that are relevant for others (not MT developers)

• Example of data pre-processing using publicly available tools• NLP pipelines for English and Latvian

Page 3: NLP pipeline in machine translation

Who am I?• My name is Mārcis Pinnis• I am a researcher at Tilde• I have worked on language

technologies for over 10 years• Currently, my research focus is

(neural) machine translation (MT)

• You can find more about my research andwhat we do in Tilde on:

Tilde MT Neural MTThe “term” cloud summarises the

topics of last five of my publications

Page 4: NLP pipeline in machine translation

What is machine translation?

And ... what are its main goals?

Page 5: NLP pipeline in machine translation

Machine Translation (MT) is «the use of computers to automate translation from one language to another»

/Jurafsky & Martin, 2009/

What is Machine Translation?

Statistical MT

Neural MT

Bing

GoogleLanguage technologies are important in human-

computer interaction.

Valodas tehnoloģijas ir svarīga cilvēka-datora mijiedarbībā.

Valodu tehnoloģijas ir svarīgs cilvēka-datora mijiedarbībā.

Valodu tehnoloģijas ir svarīgas cilvēka un datora mijiedarbībai.

Valodas tehnoloģijas ir svarīgas human-computer interaction.

Page 6: NLP pipeline in machine translation

What are the main goals of machine translation?

???

Hello!

To lower language barriers in communication

???

To provide access to information written in an unknown language

To increase productivity, e.g., of professional

translators

Page 7: NLP pipeline in machine translation

The NLP PipelineFrom the very basic to the more challenging tasks

Page 8: NLP pipeline in machine translation

What to Start with?• Imagine that you have to translate the following text

• What will you do first?

The European Union (EU) is a political and economic union of 28 member states that are located primarily in Europe. It has an area of 4,475,757 km2 (1,728,099 sq mi), and an estimated population of over 510 million.

Page 9: NLP pipeline in machine translation

Sentence Breaking• First, we will split the text into sentences.• Most basic NLP tools work with individual sentences, therefore, this is

a mandatory step.The European Union (EU) is a political and economic union of 28 member states that are located primarily in Europe. It has an area of 4,475,757 km2 (1,728,099 sq mi), and an estimated population of over 510 million.

The European Union (EU) is a political and economic union of 28 member states that are located primarily in Europe.It has an area of 4,475,757 km2 (1,728,099 sq mi), and an estimated population of over 510 million.

Page 10: NLP pipeline in machine translation

Tokenisation• Then, we will split the text into

primitive textual units - tokens.The European Union (EU) is a political and economic union of 28 member states that are located primarily in Europe.It has an area of 4,475,757 km2 (1,728,099 sq mi), and an estimated population of over 510 million.

The European Union ( EU ) is a political and economic union of 28 member states that are located primarily in Europe .It has an area of 4,475,757 km 2 ( 1,728,099 sq mi ) , and an estimated population of over 510 million .

WORDPUNCTUATIONNUMERALIDSYMBOLURLXMLDATETIMEEMAILSMILEYHASHTAGCASHTAGMENTIONRETWEETOTHER

Page 11: NLP pipeline in machine translation

Now that the text is broken into tokens, let us look at some examples...

control systemdry clothes

moving man

The phrases are ambiguous!

Can you guess the translations of the following phrases?

Page 12: NLP pipeline in machine translation

Possible translations may be...

control systemdry clothes

moving man

Note: the phrases are way more ambiguous than the two examples given!

?

?

? ?

?

? kontroles sistēma

sausas drēbes

aizkustinošs vīrskustīgs cilvēks

žāvē drēbes

pārbaudi sistēmu

Page 13: NLP pipeline in machine translation

The context is important!

To control system parameters, open Settings

Dry clothes can be taken out of the drier

He was a very moving man

In some cases, morphological disambiguation helps us to choose better translations.

Page 14: NLP pipeline in machine translation

Morphological Analysis• Allows to gain insight in the morphological ambiguity of words• Lists «all» possible (morphological) analyses of a word• Often is limited to a vocabulary

control dry moving

Noun, singular

Verb, first person, simple present

Verb, second person, simple present...

Verb, second person, simple present

Adjective, positive

Verb, first person, simple present

...

Noun, singular

Adjective

Verb...

Page 15: NLP pipeline in machine translation

Morphological Tagging• Allows to perform morphological disambiguation of

words using the context they are found in

• We have solved the morphological ambiguity of «dry»• When selecting translation equivalents for the word, we

will be able to take the disambiguated data into account• E.g., «dry» = «sauss» and «dry» != «žāvēt»

Dry clothes can be taken out of the drier

JJ = adjectiveNNS = noun, pluralMD = modal verbVB = verbVBN = verb, past participleRP = particleIN = prepositionDT = determinerNN = noun, singular

JJ VBNNNS MD VB RP IN DT NN

Page 16: NLP pipeline in machine translation

The context continues to be more important!• Often it is not enough to perform morphological disambiguation• We need to understand the words in a context• We need to figure out which word modifies or depends on which

other word in order to:• translate words in the correct order • translate words in the correct inflected forms

Page 17: NLP pipeline in machine translation

Syntactic Parsing• We use a syntactic parser to tell

us, which words depend on which other words in a sentence and how phrases are structured• We now know that «dry» is an

adjectival modifier of «clothes»• From this we can conclude that,

e.g.:«dry» = «sausas» or «sausās»«dry» != «sausām» or «sauso»

The example has been parsed with the Stanford Parser. You can try it here: http://corenlp.run/

Cons

titue

ncy

tree

Depe

nden

cy tr

ee

Page 18: NLP pipeline in machine translation

What is missing?• We solved:• the morphological ambiguity• the syntactic ambiguity

• What next?• How would you translate

this sentence?

He took a tablet?

?

?

Page 19: NLP pipeline in machine translation

Terms and named entities• There is actually context missing to identify the intended meaning, right?• Term recognition (TR) and named entity recognition (NER) tools allow us

to perform semantic disambiguation

Stīvs Gulbis vinnēja loterijā

Stīvs Gulbis vinnēja loterijā Stīvs Gulbis vinnēja loterijā

PERSON

With NER Without NER

Stivs Gulbis won the lottery A stiff swan won the lottery

He took a tablettablet = tablete

Viņš paņēma planšetiViņš iedzēra tableti

He took a tablet He took a tablet

With TR Without TR

Page 20: NLP pipeline in machine translation

Semantic Analysis• For some tasks (e.g., question answering), natural language understanding

requires semantic parsing.• E.g., shallow semantic parsing (a.k.a. semantic role labelling) allows us to analyse

meaning by identifying predicates and their arguments in a sentence

• However, in MT, we tend not to go this deep in text analysis (one reason - these tools are not widely available for many languages)

The example has been created with the Semantic Role Labeling Demo of University of Illinois at Urbana-Champaign. You can try it here: http://cogcomp.cs.illinois.edu/page/demo_view/srl

Page 21: NLP pipeline in machine translation

The Building Blocks of NLPPragmati

cs

SemanticsSyntax

Morphology

Page 22: NLP pipeline in machine translation

Were these all tasks of NLP?

• Obviously, not!• This just barely touches the surface!• Other topics that have to be addressed:

• Discourse related phenomena - anaphora resolution, coreferences

• Document translation (handling of formatting tags)• Domain adaptation• Interactive translation• Localisation (e.g., correct formatting of numbers, dates,

punctuations, units of measurement, etc.)• Named entities• Online learning• Post-editing and computer-assisted translation• Quality estimation• Robustness to training data noise• Rule-based vs. statistical vs. neural vs. hybrid machine

translation• Terminology• Truecasing and recasing• Etc.

• Obviously, not!• Other tasks that were not discussed are, e.g.:

• Anaphora resolution• Coreference analysis• Detokenisation• Language identification• Semantic role labelling

(a.k.a., shallow semantic parsing)• Sentiment analysis• Stemming• Truecasing and recasing• Word segmentation

(e.g., for Arabic, Japanese)• Word sense disambiguation• Word splitting (e.g., in sub-word units or compound

splitting)• Etc.

Does this address all issues of MT?

Page 23: NLP pipeline in machine translation

Do we have time left for a short demo?If not, try it yourself: https://github.com/pmarcis/nlp-example

Page 24: NLP pipeline in machine translation