tokenization and word segmentation - cuni.cz
TRANSCRIPT
![Page 1: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/1.jpg)
Tokenization and WordSegmentationDaniel Zeman, Rudolf Rosa
March 3, 2022
NPFL120 Multilingual Natural Language Processing
Charles UniversityFaculty of Mathematics and PhysicsInstitute of Formal and Applied Linguistics unless otherwise stated
![Page 2: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/2.jpg)
Tokenization and Word Segmentation
• IMPORTANT because:• Training tokenization = test tokenization• ⇒ accuracy goes down
• Not always trivial• May interact with morphology
• May include normalization (character-level)
Tokenization and Word Segmentation 1/28
![Page 3: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/3.jpg)
Tokenization and Word Segmentation
• Issues of orthography of individual languages
• Issues caused by design decisions of individual corpora
• We will refer to the Universal Dependencies project (UD;https://universaldependencies.org/); more info in following weeks
• Due to limited time, we will probably skip some slides at the end
Tokenization and Word Segmentation 2/28
![Page 4: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/4.jpg)
Tokenization
“María, I love you!” Juan exclaimed.«¡María, te amo!», exclamó Juan.
X PRON X VERB X
« ¡ María , te amo ! » ,PUNCT PUNCT PROPN PUNCT PRON VERB PUNCT PUNCT PUNCT
• Classic tokenization:• Separate punctuation from words• Recognize certain clusters of symbols like “...”• Perhaps keep together things like [email protected]
Tokenization and Word Segmentation 3/28
![Page 5: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/5.jpg)
Using Unicode Character Categories
• https://perldoc.perl.org/perlunicode.html
$text =~ s/(\pP)/ $1 /g;$text =~ s/^\s+//;$text =~ s/\s+$//;
• $text =~ s/(\pP)/ $1 /g;• Optionally recombine email addresses, URLs etc.
• Some problems• haven ’ t (English; should be have n’t)• instal · lació (Catalan; should be 1 token)• single quote (punctuation) misspelled as acute accent (modifier letter)
• writing systems without spaces
Tokenization and Word Segmentation 4/28
![Page 6: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/6.jpg)
Using Unicode Character Categories
• https://perldoc.perl.org/perlunicode.html
$text =~ s/(\pP)/ $1 /g;$text =~ s/^\s+//;$text =~ s/\s+$//;
• $text =~ s/(\pP)/ $1 /g;• Optionally recombine email addresses, URLs etc.
• Some problems• haven ’ t (English; should be have n’t)• instal · lació (Catalan; should be 1 token)• single quote (punctuation) misspelled as acute accent (modifier letter)
• writing systems without spaces
Tokenization and Word Segmentation 4/28
![Page 7: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/7.jpg)
Normalization• Often part of tokenization
• Decimal comma to decimal point; separator of thousands
• Unicode directed quotes and long hyphens to undirected ASCII• “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —
»magyar« — ’magyar’• Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.
• TEX-like ASCII directed quotes `` and '' and hyphens -- and ---• English/ASCII punctuation in foreign writing systems
• 「你看過《三國演義》嗎?」他問我。• “你看過‘三國演義’嗎?”他問我.
• European/ASCII digits in Arabic, Devanagari etc.• 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)• ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)• ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)
Tokenization and Word Segmentation 5/28
![Page 8: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/8.jpg)
Normalization• Often part of tokenization
• Decimal comma to decimal point; separator of thousands• Unicode directed quotes and long hyphens to undirected ASCII
• “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —»magyar« — ’magyar’
• Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.
• TEX-like ASCII directed quotes `` and '' and hyphens -- and ---• English/ASCII punctuation in foreign writing systems
• 「你看過《三國演義》嗎?」他問我。• “你看過‘三國演義’嗎?”他問我.
• European/ASCII digits in Arabic, Devanagari etc.• 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)• ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)• ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)
Tokenization and Word Segmentation 5/28
![Page 9: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/9.jpg)
Normalization• Often part of tokenization
• Decimal comma to decimal point; separator of thousands• Unicode directed quotes and long hyphens to undirected ASCII
• “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —»magyar« — ’magyar’
• Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.• TEX-like ASCII directed quotes `` and '' and hyphens -- and ---
• English/ASCII punctuation in foreign writing systems• 「你看過《三國演義》嗎?」他問我。• “你看過‘三國演義’嗎?”他問我.
• European/ASCII digits in Arabic, Devanagari etc.• 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)• ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)• ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)
Tokenization and Word Segmentation 5/28
![Page 10: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/10.jpg)
Normalization• Often part of tokenization
• Decimal comma to decimal point; separator of thousands• Unicode directed quotes and long hyphens to undirected ASCII
• “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —»magyar« — ’magyar’
• Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.• TEX-like ASCII directed quotes `` and '' and hyphens -- and ---• English/ASCII punctuation in foreign writing systems
• 「你看過《三國演義》嗎?」他問我。• “你看過‘三國演義’嗎?”他問我.
• European/ASCII digits in Arabic, Devanagari etc.• 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)• ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)• ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)
Tokenization and Word Segmentation 5/28
![Page 11: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/11.jpg)
Normalization• Often part of tokenization
• Decimal comma to decimal point; separator of thousands• Unicode directed quotes and long hyphens to undirected ASCII
• “English” — ‘English’ — „česky“ — ‚česky‘ — « français » — ‹ français › — „magyar” —»magyar« — ’magyar’
• Sometimes mistaken for ACUTE ACCENT, PRIME (math) etc.• TEX-like ASCII directed quotes `` and '' and hyphens -- and ---• English/ASCII punctuation in foreign writing systems
• 「你看過《三國演義》嗎?」他問我。• “你看過‘三國演義’嗎?”他問我.
• European/ASCII digits in Arabic, Devanagari etc.• 0 1 2 3 4 5 6 7 8 9 (Western Arabic/European)• ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ (Eastern Arabic)• ० १ २ ३ ४ ५ ६ ७ ८ ९ (Devanagari)
Tokenization and Word Segmentation 5/28
![Page 12: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/12.jpg)
Word Segmentation
Let’s go to the sea.Vámonos al mar .VERB? X NOUN PUNCT
Vamos nos a el mar .VERB PRON ADP DET NOUN PUNCT
• Syntactic word vs. orthographic word• Multi-word tokens• Two-level scheme:
• Tokenization (low level, punctuation, concatenative)• Word segmentation (higher level, not necessarily concatenative)
Tokenization and Word Segmentation 6/28
![Page 13: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/13.jpg)
Word Segmentation
• Orthographic vs. syntactic word• Syntactically autonomous part of orthographic word• Contractions (al = a + el)• Clitics (vámonos = vamos + nos)
• ¿A qué hora nos vamos mañana?“What time do we leave tomorrow?”
• Nos despertamos a las cinco.“We wake up at five.”
• Nuestro guía nos despierta a las cinco.“Our guide wakes us up at five.”
Tokenization and Word Segmentation 7/28
![Page 14: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/14.jpg)
Contractions in Arabic
He abdicated in favour of his son Baudouin.يتنازل عن العرش لابنه بودوان
yatanāzalu ʿan al-ʿarši li+ibni+hi būdūānsurrendered on the throne to son his Baudouin
VERB ADP NOUN ADP+NOUN+PRON PROPN
Tokenization and Word Segmentation 8/28
![Page 15: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/15.jpg)
Segmentation as Part of Morphological Analysis
• Arabic• ElixirFM: http://lindat.mff.cuni.cz/services/elixirfm/run.php• Select Resolve• Enter “لابنه” (labnh)
• Sanskrit• Sanskrit Reader Companion: https://sanskrit.inria.fr/DICO/reader.fr.html• Select Input convention = Devanagari• Enter “सकलाथरशा सार जगित समालोकय िवषणशमदम” (sakalārthaśāstrasāram jagati samālokya
viṣṇuśarmedam)
• German compound splitting (unsupervised)• Not split in Universal Dependencies
Tokenization and Word Segmentation 9/28
![Page 16: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/16.jpg)
Chinese Word Segmentation
We are now in Valencia.現在我們在瓦倫西亞。
Xiàn zài wǒ men zài wǎ lún xī yǎ.We are now in Valencia.
現在 我們 在 瓦倫西亞 。Xiànzài wǒmen zài Wǎlúnxīyǎ .
Now we in Valencia .ADV PRON ADP PROPN PUNCT
Tokenization and Word Segmentation 10/28
![Page 17: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/17.jpg)
Words in Japanese
I went to the beauty salon of Kyōdō [, Beyond-R.]
経堂 の 美容室 に 行っ て き まし たKyōdō no miyōshitsu ni it te ki mashi ta経堂 の 美容室 に 行く て 来る ます た
Kyōdō of beauty-salon to go CONV come will PASTPROPN ADP NOUN ADP VERB SCONJ AUX AUX AUX
nmod
case
obl
case
aux
aux
aux
mark
Tokenization and Word Segmentation 11/28
![Page 18: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/18.jpg)
Words in Japanese
I went to the beauty salon of Kyōdō [, Beyond-R.]
経堂 の 美容室 に 行って きましたKyōdō no miyōshitsu ni itte kimashita経堂 の 美容室 に 行く 来る
Kyōdō of beauty-salon to going comePROPN ADP NOUN ADP VERB VERB
VerbForm=Conv VerbForm=FinTense=PastPolite=Form
nmod
case
obl
case advcl
Tokenization and Word Segmentation 12/28
![Page 19: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/19.jpg)
Words in Japanese
I went to the beauty salon of Kyōdō [, Beyond-R.]
経堂の 美容室に 行って きましたKyōdōno miyōshitsuni itte kimashita経堂 美容室 行く 来る
of-Kyōdō to-beauty-salon going comePROPN NOUN VERB VERBCase=Gen Case=Dat VerbForm=Conv VerbForm=Fin
Tense=PastPolite=Form
nmod obl advcl
Tokenization and Word Segmentation 13/28
![Page 20: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/20.jpg)
Vietnamese: Words with Spaces
All the concrete country roads are the result of…Tất cả đường bêtông nội đồng là thành quả …
All road concrete country is achievement …PRON NOUN NOUN NOUN AUX NOUN PUNCT
• Spaces delimit monosyllabic morphemes, not words.• Multiple syllables without space occur in loanwords (bêtông).• Spaces are allowed to occur word-internally in Vietnamese UD.
Tokenization and Word Segmentation 14/28
![Page 21: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/21.jpg)
Numbers with Spaces
# text = Il touche environ 100 000 sesterces par an.1 Il il PRON … 2 nsubj _ _2 touche toucher VERB … 0 root _ _3 environ environ ADV … 4 advmod _ _4 100 000 100 000 NUM … 5 nummod _ _5 sesterces sesterce NOUN … 2 obj _ _6 par par ADP … 7 case _ _7 an an NOUN … 2 obl _ SpaceAfter=No8 . . PUNCT … 2 punct _ _
Tokenization and Word Segmentation 15/28
![Page 22: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/22.jpg)
Fixed Expressions
One syntactic word spans several orthographic words?# text = Bin nach wie vor sehr zufrieden.# text_en = I am still very satisfied.1 Bin sein AUX … 6 cop _ _2 nach nach ADP … 6 obl _ _3 wie wie ADV … 2 fixed _ _4 vor vor ADP … 2 fixed _ _5 sehr sehr ADV … 6 advmod _ _6 zufrieden zufrieden ADJ … 0 root _ SpaceAfter=No7 . . PUNCT … 6 obl _ _
Tokenization and Word Segmentation 16/28
![Page 23: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/23.jpg)
Fixed Expressions
One syntactic word spans several orthographic words?I am still very satisfied.
Bin nach wie vor sehr zufrieden .Am after like before very satisfied .
AUX ADP ADV ADP ADV ADJ PUNCT
cop
obl
fixed
fixed advmod punct
Tokenization and Word Segmentation 17/28
![Page 24: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/24.jpg)
Multi-Word Expressions outside UD
Some corpora use the underscore character to glue MWEs together.
I am still very satisfied.
Bin nach_wie_vor sehr zufrieden .Am after_like_before very satisfied .
AUX ADV ADV ADJ PUNCT
cop
advmod
advmod punct
Tokenization and Word Segmentation 18/28
![Page 25: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/25.jpg)
Multi-Word Expressions outside UD
Some corpora use the underscore character to glue MWEs together.
• Durante la presentación del libro ”La_prosperidad_por_medio_de_la_investigación_._La_investigación_básica_en_EEUU” , editado por la Comunidad_de_Madrid , el secretario general de laConfederación_Empresarial_de_Madrid-CEOE ( CEIM ) , Alejandro_Couceiro , abogópor la formación de los investigadores en temas de innovación tecnológica .
• Lemmas?• Tags?
Tokenization and Word Segmentation 19/28
![Page 26: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/26.jpg)
Word Segmentation Summary
• When to split?• Only part of the token involved in a relation to something outside the token? Split!
• Hard time finding POS tag? Split!• Hard time finding dependency relation? Don’t split!
• Or not hard time but the relation would be compound, flat, fixed or goeswith.• Border case? Keep orthographic words (if they exist).
• Words with spaces• Vietnamese writing system• Very restricted set of exceptions (numbers)• Special relations elsewhere (fixed, compound)
Tokenization and Word Segmentation 20/28
![Page 27: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/27.jpg)
Word Segmentation Summary
• When to split?• Only part of the token involved in a relation to something outside the token? Split!• Hard time finding POS tag? Split!
• Hard time finding dependency relation? Don’t split!• Or not hard time but the relation would be compound, flat, fixed or goeswith.
• Border case? Keep orthographic words (if they exist).
• Words with spaces• Vietnamese writing system• Very restricted set of exceptions (numbers)• Special relations elsewhere (fixed, compound)
Tokenization and Word Segmentation 20/28
![Page 28: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/28.jpg)
Word Segmentation Summary
• When to split?• Only part of the token involved in a relation to something outside the token? Split!• Hard time finding POS tag? Split!• Hard time finding dependency relation? Don’t split!
• Or not hard time but the relation would be compound, flat, fixed or goeswith.
• Border case? Keep orthographic words (if they exist).
• Words with spaces• Vietnamese writing system• Very restricted set of exceptions (numbers)• Special relations elsewhere (fixed, compound)
Tokenization and Word Segmentation 20/28
![Page 29: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/29.jpg)
Word Segmentation Summary
• When to split?• Only part of the token involved in a relation to something outside the token? Split!• Hard time finding POS tag? Split!• Hard time finding dependency relation? Don’t split!
• Or not hard time but the relation would be compound, flat, fixed or goeswith.• Border case? Keep orthographic words (if they exist).
• Words with spaces• Vietnamese writing system• Very restricted set of exceptions (numbers)• Special relations elsewhere (fixed, compound)
Tokenization and Word Segmentation 20/28
![Page 30: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/30.jpg)
Word Segmentation Summary
• When to split?• Only part of the token involved in a relation to something outside the token? Split!• Hard time finding POS tag? Split!• Hard time finding dependency relation? Don’t split!
• Or not hard time but the relation would be compound, flat, fixed or goeswith.• Border case? Keep orthographic words (if they exist).
• Words with spaces• Vietnamese writing system• Very restricted set of exceptions (numbers)• Special relations elsewhere (fixed, compound)
Tokenization and Word Segmentation 20/28
![Page 31: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/31.jpg)
Recoverability: CoNLL-U Format
# text = Vámonos al mar.# text_en = Let’s go to the sea.ID FORM LEMMA UPOS … HEAD _ MISC1-2 Vámonos _ _ … _ _ _ _1 Vamos ir VERB … 0 root _ _2 nos nosotros PRON … 1 obj _ _3-4 al _ _ … _ _ _ _3 a a ADP … 5 case _ _4 el el DET … 5 det _ _5 mar mar NOUN … 1 obl _ SpaceAfter=No6 . . PUNCT … 1 punct _ _
Tokenization and Word Segmentation 21/28
![Page 32: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/32.jpg)
Recoverability: CoNLL-U Format
# text = Vámonos al mar.# text_en = Let’s go to the sea.ID FORM LEMMA UPOS … HEAD _ MISC1-2 Vámonos _ _ … _ _ _ _1 Vamos ir VERB … 0 root _ _2 nos nosotros PRON … 1 obj _ _3-4 al _ _ … _ _ _ _3 a a ADP … 5 case _ _4 el el DET … 5 det _ _5-6 mar. _ _ … _ _ _ _5 mar mar NOUN … 1 obl _ _6 . . PUNCT … 1 punct _ _
Tokenization and Word Segmentation 21/28
![Page 33: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/33.jpg)
Tokenization vs. Multi-word Tokens
• Parallelism among closely related languages• ca: informar-se sobre el patrimoni cultural• es: informarse sobre el patrimonio cultural• en: learn about cultural heritage
• ca: L’únic que veig és => L’ únic que veig és• en: don’t => do n’t
• No strict guidelines for tokenization (yet)• UD English: non-stop, post-war: single-word tokens• UD Czech: non-stop would be split to three tokens• Abbreviations: etc.
• End of sentence…
Tokenization and Word Segmentation 22/28
![Page 34: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/34.jpg)
Tokenization vs. Multi-word Tokens Summary
• Punctuation involved? Low level!• Exceptions: Spanish-Catalan parallelism.
• Boundary between two letters? Typically high level.• Exceptions: Chinese, Japanese.
• Non-concatenative? High level!
Tokenization and Word Segmentation 23/28
![Page 35: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/35.jpg)
Tokenization vs. Multi-word Tokens Summary
• Punctuation involved? Low level!• Exceptions: Spanish-Catalan parallelism.
• Boundary between two letters? Typically high level.• Exceptions: Chinese, Japanese.
• Non-concatenative? High level!
Tokenization and Word Segmentation 23/28
![Page 36: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/36.jpg)
Tokenization vs. Multi-word Tokens Summary
• Punctuation involved? Low level!• Exceptions: Spanish-Catalan parallelism.
• Boundary between two letters? Typically high level.• Exceptions: Chinese, Japanese.
• Non-concatenative? High level!
Tokenization and Word Segmentation 23/28
![Page 37: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/37.jpg)
Errors in Underlying Text
• We do not want to hide errors (learning robust parsers!)• But: reference corpora (linguistic research) may want to hide them.
• Possibilities:• Typo not involving word boundary
• FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:Correct=annotation
• Wrongly split word:ann otationX X
goeswith
• Wrongly merged words: thecar• Fix tokenization (i.e. two lines); first line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes• Sentence segmentation can be affected, too!
Tokenization and Word Segmentation 24/28
![Page 38: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/38.jpg)
Errors in Underlying Text
• We do not want to hide errors (learning robust parsers!)• But: reference corpora (linguistic research) may want to hide them.
• Possibilities:• Typo not involving word boundary
• FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:Correct=annotation
• Wrongly split word:ann otationX X
goeswith
• Wrongly merged words: thecar• Fix tokenization (i.e. two lines); first line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes• Sentence segmentation can be affected, too!
Tokenization and Word Segmentation 24/28
![Page 39: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/39.jpg)
Errors in Underlying Text
• We do not want to hide errors (learning robust parsers!)• But: reference corpora (linguistic research) may want to hide them.
• Possibilities:• Typo not involving word boundary
• FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:Correct=annotation
• Wrongly split word:ann otationX X
goeswith
• Wrongly merged words: thecar• Fix tokenization (i.e. two lines); first line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes• Sentence segmentation can be affected, too!
Tokenization and Word Segmentation 24/28
![Page 40: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/40.jpg)
Errors in Underlying Text
• We do not want to hide errors (learning robust parsers!)• But: reference corpora (linguistic research) may want to hide them.
• Possibilities:• Typo not involving word boundary
• FORM = anotation; LEMMA = annotation; FEATS: Typo=Yes; MISC:Correct=annotation
• Wrongly split word:ann otationX X
goeswith
• Wrongly merged words: thecar• Fix tokenization (i.e. two lines); first line MISC: SpaceAfter=No | CorrectSpaceAfter=Yes• Sentence segmentation can be affected, too!
Tokenization and Word Segmentation 24/28
![Page 41: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/41.jpg)
Errors in Underlying Text
• Wrong morphology: the cars is produced in Detroit
• Not like normal typo (the car iss produced…)• Not obvious what is correct
• the car is• the cars are
• Suggestion: select which word to fix, e.g. cars to car• FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing• cs: viděl moři “he saw the sea”
• Should be moře• Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)• This form is Case=Dat,Loc (but which one?)
• cestoval k moři “he traveled to the sea” Case=Dat• plavil se po moři “he sailed the sea” Case=Loc
Tokenization and Word Segmentation 25/28
![Page 42: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/42.jpg)
Errors in Underlying Text
• Wrong morphology: the cars is produced in Detroit• Not like normal typo (the car iss produced…)
• Not obvious what is correct• the car is• the cars are
• Suggestion: select which word to fix, e.g. cars to car• FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing• cs: viděl moři “he saw the sea”
• Should be moře• Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)• This form is Case=Dat,Loc (but which one?)
• cestoval k moři “he traveled to the sea” Case=Dat• plavil se po moři “he sailed the sea” Case=Loc
Tokenization and Word Segmentation 25/28
![Page 43: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/43.jpg)
Errors in Underlying Text
• Wrong morphology: the cars is produced in Detroit• Not like normal typo (the car iss produced…)• Not obvious what is correct
• the car is• the cars are
• Suggestion: select which word to fix, e.g. cars to car• FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing• cs: viděl moři “he saw the sea”
• Should be moře• Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)• This form is Case=Dat,Loc (but which one?)
• cestoval k moři “he traveled to the sea” Case=Dat• plavil se po moři “he sailed the sea” Case=Loc
Tokenization and Word Segmentation 25/28
![Page 44: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/44.jpg)
Errors in Underlying Text
• Wrong morphology: the cars is produced in Detroit• Not like normal typo (the car iss produced…)• Not obvious what is correct
• the car is• the cars are
• Suggestion: select which word to fix, e.g. cars to car• FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing
• cs: viděl moři “he saw the sea”• Should be moře• Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)• This form is Case=Dat,Loc (but which one?)
• cestoval k moři “he traveled to the sea” Case=Dat• plavil se po moři “he sailed the sea” Case=Loc
Tokenization and Word Segmentation 25/28
![Page 45: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/45.jpg)
Errors in Underlying Text
• Wrong morphology: the cars is produced in Detroit• Not like normal typo (the car iss produced…)• Not obvious what is correct
• the car is• the cars are
• Suggestion: select which word to fix, e.g. cars to car• FORM = cars; FEATS: Number=Plur; MISC: Correct=car | CorrectNumber=Sing• cs: viděl moři “he saw the sea”
• Should be moře• Would be Case=Acc (disambiguated from Case=Acc,Gen,Nom,Voc)• This form is Case=Dat,Loc (but which one?)
• cestoval k moři “he traveled to the sea” Case=Dat• plavil se po moři “he sailed the sea” Case=Loc
Tokenization and Word Segmentation 25/28
![Page 46: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/46.jpg)
Tokenization Alignment
• If you need to match two different tokenizations• Use case: evaluation of end-to-end parsing systems
• Normalization involved? Bad luck…• Normalization rules needed• Or: Longest common subsequence (LCS) algorithm
• Otherwise easy• Non-whitespace character offsets
Tokenization and Word Segmentation 26/28
![Page 47: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/47.jpg)
Tokenization Alignment
• If you need to match two different tokenizations• Use case: evaluation of end-to-end parsing systems
• Normalization involved? Bad luck…• Normalization rules needed• Or: Longest common subsequence (LCS) algorithm
• Otherwise easy• Non-whitespace character offsets
Tokenization and Word Segmentation 26/28
![Page 48: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/48.jpg)
Tokenization Alignment
• If you need to match two different tokenizations• Use case: evaluation of end-to-end parsing systems
• Normalization involved? Bad luck…• Normalization rules needed• Or: Longest common subsequence (LCS) algorithm
• Otherwise easy• Non-whitespace character offsets
Tokenization and Word Segmentation 26/28
![Page 49: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/49.jpg)
Evaluation Metrics
• Align system-output tokens to gold tokensAl-Zaman : American forces killed Shaikh Abdullah al-Ani, the preacher at the mosque inthe town of Qaim, near the Syrian border.
GOLD: Al - Zaman : American forces killed ShaikhOFFSET: 0-1 2 3-7 8 9-16 17-22 23-28 29-34
• All characters except for whitespace match => easy align!SYSTEM: Al-Zaman : American forces killed ShaikhOFFSET: 0-7 8 9-16 17-22 23-28 29-34
Tokenization and Word Segmentation 27/28
![Page 50: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/50.jpg)
Evaluation Metrics
• Align system-output tokens to gold tokensDie Kosten sind definitiv auch im Rahmen.
GOLD: Die Kosten sind definitiv auch im Rahmen .SPLIT: Die Kosten sind definitiv auch in dem Rahmen .
OFFSET: 0-2 3-8 9-12 13-21 22-25 26-27 28-33 34
• Corresponding but not identical spans?• Find longest common subsequence
SYSTEM: Kosten sind definitiv auch im Rahmen .SPLIT: Kosten sind de finitiv auch im Rahmen .
OFFSET: 3-8 9-12 13-21 22-25 26-27 28-33 34
Tokenization and Word Segmentation 28/28
![Page 51: Tokenization and Word Segmentation - cuni.cz](https://reader035.vdocuments.site/reader035/viewer/2022070503/62c14da10a4235299a397526/html5/thumbnails/51.jpg)
Evaluation Metrics
• Align system-output tokens to gold tokensDie Kosten sind definitiv auch im Rahmen.
GOLD: Die Kosten sind definitiv auch im Rahmen .SPLIT: Die Kosten sind definitiv auch in dem Rahmen .
OFFSET: 0-2 3-8 9-12 13-21 22-25 26-27 28-33 34
• Corresponding but not identical spans?• Find longest common subsequence
SYSTEM: auch im Rahmen .SPLIT: auch in einem , dem alle zustimmen , Rahmen .
OFFSET: 22-25 26-27 28-33 34
Tokenization and Word Segmentation 28/28