tokenization tokens terms stemming...david eric grohl (born january 14, 1969) is an american...

37

Recap§ Right document granularity is application-dependent

§ Tokenization splits documents into tokens, which are thenfurther normalized to obtain terms as meaningful unitsthat are kept in a dictionary and can be looked up

§ Stemming and lemmatization try to map different forms of the same word onto a canonical representation

§ Spelling mistakes by users are a problem and can bemitigated by trying to map to a known word

Information Retrieval / Chapter 2: Natural Language Preprocessing

38

2.6 Synonymy and Polysemy§ Synonyms (i.e., different words having the same meaning)

and polysems (i.e., words having multiple meanings) are a challenge for IR systems

§ Lexical databases provide descriptions of the different meanings of a word and information about relationshipsbetween different words

§ These can then be used in an IR system, for instance,to automatically expand queries by adding synonymsor adding related words for disambiguation


39

Word Relations§ Different relations can exist between two words, e.g.:

§ Synonymy (identical meaning)(e.g., car – automobile, holidays – vacation)

§ Antonymy (opposite meaning)(e.g., lucky – unlucky, expensive – cheap)

§ Hypernymy (more general)(e.g., mammal – rodent, machine – computer)

§ Hyponymy (more specific)(e.g., rat – rodent, rodent – mammal)

§ Meronymy (part of relation)(e.g., tree – forest, board – computer)


40

WordNet§ WordNet is a lexical database for English that has been

manually curated for more than twenty yearshttps://wordnet.princeton.edu

§ WordNet groups words having the same meaning intoa synset and provides a gloss as a short explanationof its meaning for each of them

§ In addition it provides information about relations (e.g., antonymy, meronymy, hypernymy, hyponymy)between different synsets


https://wordnet.princeton.edu/

41

WordNet§ Example: Synsets for the word car


42

WordNet§ Example: Direct hyponyms for the word car


43

GermaNet§ GermaNet is a lexical database for German, which is

similar in terms of functionality to WordNethttp://www.sfs.uni-tuebingen.de/GermaNet/http://weblicht.sfs.uni-tuebingen.de/germanet/


http://www.sfs.uni-tuebingen.de/GermaNet/

http://weblicht.sfs.uni-tuebingen.de/germanet/

44

GermaNet§ Example: Synsets and relations for the word Wagen


45

Wiktionary§ Wiktionary is a Wikimedia Foundation project aiming to

collaboratively construct lexical databasesfor different languageshttps://www.wiktionary.org


https://www.wiktionary.org/

46

Wiktionary§ Example: Available information for the word Wagen


47

Universal WordNet (UWN)§ Universal WordNet (UWN) is a multilingual lexical

database connecting words and synsets across different languageshttp://www.lexvo.org/uwn/


http://www.lexvo.org/uwn/

48

Universal WordNet (UWN)§ Example: Available information for the word car


49

2.7 Parts of Speech§ Part-of-speech tagging adorns words in natural language

texts with their part of speech (e.g., noun, verb, adverb)

§ Part-of-speech tags are essential for other tasks such as lemmatization and named entity recognition and downstream applications such as information extraction


50

POS Tags§ The Penn Treebank project provides a commonly used

list of part-of-speech tags, e.g.:

§ DT Determiner

§ JJ Adjective

§ JJR Adjective, comparative

§ NN Noun, singular or mass

§ NNS Noun, plural

§ NNP Proper noun, singular

§ NNPS Proper noun, plural

§ VB Verb, base form

§ Penn Treebank POS TagsInformation Retrieval / Chapter 2: Natural Language Preprocessing

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

51

Parts of Speech§ Example: POS tags determined by Stanford CoreNLP


David Eric Grohl (born January 14, 1969) is an Americanmusician, singer, songwriter, record producer, multi-instrumentalist and film director. He is the founder, frontman,lead vocalist, rhythm guitarist, lead guitarist, and primarysongwriter of the rock band Foo Fighters since 1994, and wasthe longest-serving drummer for the rock band Nirvana from1990 to 1994.

Source: https://en.wikipedia.org/wiki/Dave_Grohl

52

2.8 Dependencies§ Dependency parsing identifies so-called head words in

natural language texts and identifies relationshipsbetween those and other words that modify them

§ The Universal Dependencies Framework provides a collection of consistent annotations across languageshttp://universaldependencies.org

§ Dependencies are essential for other tasks such as co-reference resolution and information extraction


http://universaldependencies.org/

53

Dependencies§ Example: Dependencies determined by Stanford CoreNLP


Foo Fighters have found worldwide success winning multipleawards, most notably with four of its albums winning GrammyAwards for Best Rock Album.


54

2.9 Co-Reference Resolution§ Co-reference resolution seeks to connect expressions

in a natural language text (e.g., pronouns) that refer to the same entity

§ Resolving co-references can be useful for IR systems (e.g., when generating result snippets) and otherapplications such as text summarization


55

Co-Reference Resolution§ Example: Co-references determined by Stanford CoreNLP




56

2.10 Named Entity Recognition and Disambiguation§ Named entity recognition spots expressions in natural

language texts that refer to a named entity (e.g., person, organization, or location) and assigns a type to them

§ Recognizing named entities is a prerequisite for linking expressions in natural language texts to known named entities (e.g., in Wikipedia) and supports taskssuch as information extraction


57

Named Entity Recognition§ Example: Named entities spotted by Stanford CoreNLP




58

Named Entity Disambiguation§ Named entity disambiguation links mentions of named

entities to articles in Wikipedia or entries in a knowledge graph

§ Spotting and disambiguating named entities in natural language texts allows for richer search functionalitiesand can support tasks such as information extraction


59

Named Entity Disambiguation§ Example: Named entities linked by AIDA (Ambiverse)


The Beatles built their reputationplaying clubs in Liverpool andHamburg over a three-year period from1960, with Stuart Sutcliffe initiallyserving as bass player. The core trio ofLennon, McCartney and Harrison,together since 1958, went through asuccession of drummers, includingPete Best, before asking Starr to jointhem in 1962

Source: https://en.wikipedia.org/wiki/The_Beatles

60

Natural Language Processing Toolkits§ Stanford CoreNLP (POS, dependencies, NER)

https://stanfordnlp.github.io/CoreNLP/

http://nlp.stanford.edu:8080/corenlp/

§ Natural Language Toolkit (POS, dependencies, NER)

http://www.nltk.org

§ AllenNLPhttps://allennlp.org


https://stanfordnlp.github.io/CoreNLP/

http://nlp.stanford.edu:8080/corenlp/

http://www.nltk.org/

https://allennlp.org/

61

Natural Language Processing Toolkits§ AIDA (NED)

https://gate.d5.mpi-inf.mpg.de/webaida/https://www.ambiverse.com/nl-api/

§ TagMe (NED)https://tagme.d4science.org/tagme/

§ DBPedia Spotlight (NED)https://www.dbpedia-spotlight.org/demo/


https://gate.d5.mpi-inf.mpg.de/webaida/

https://www.ambiverse.com/natural-language-understanding-api/

https://tagme.d4science.org/tagme/

https://www.dbpedia-spotlight.org/demo/

62

Summary§ Lexical databases such a WordNet provide information

about meanings of words and relations between words

§ Part-of-speech tagging labels words in natural language texts with their part of speech (e.g., noun or verb); dependency parsing establishes relationsbetween head words and modifiers

§ Named entity recognition and disambiguation spot expressions in natural language texts that refer tonamed entities and link them, e.g., to theircorresponding Wikipedia article


63

Literature[1] C. D. Manning, P. Raghavan, and H. Schütze:

Introduction to Information Retrieval,Cambridge University Press, 2008 (Chapter 1)

[2] W. B. Croft, D. Metzler, and T. Strohman:Search Engines – Information Retrievalin Practice, Pearson Education, 2009 (Chapter 1)


tokenization tokens terms stemming...david eric grohl (born january 14, 1969) is an american...

Documents