tokenization tokens terms stemming...david eric grohl (born january 14, 1969) is an american...
TRANSCRIPT
37
Recap§ Right document granularity is application-dependent
§ Tokenization splits documents into tokens, which are thenfurther normalized to obtain terms as meaningful unitsthat are kept in a dictionary and can be looked up
§ Stemming and lemmatization try to map different forms of the same word onto a canonical representation
§ Spelling mistakes by users are a problem and can bemitigated by trying to map to a known word
Information Retrieval / Chapter 2: Natural Language Preprocessing
38
2.6 Synonymy and Polysemy§ Synonyms (i.e., different words having the same meaning)
and polysems (i.e., words having multiple meanings) are a challenge for IR systems
§ Lexical databases provide descriptions of the different meanings of a word and information about relationshipsbetween different words
§ These can then be used in an IR system, for instance,to automatically expand queries by adding synonymsor adding related words for disambiguation
Information Retrieval / Chapter 2: Natural Language Preprocessing
39
Word Relations§ Different relations can exist between two words, e.g.:
§ Synonymy (identical meaning)(e.g., car – automobile, holidays – vacation)
§ Antonymy (opposite meaning)(e.g., lucky – unlucky, expensive – cheap)
§ Hypernymy (more general)(e.g., mammal – rodent, machine – computer)
§ Hyponymy (more specific)(e.g., rat – rodent, rodent – mammal)
§ Meronymy (part of relation)(e.g., tree – forest, board – computer)
Information Retrieval / Chapter 2: Natural Language Preprocessing
40
WordNet§ WordNet is a lexical database for English that has been
manually curated for more than twenty yearshttps://wordnet.princeton.edu
§ WordNet groups words having the same meaning intoa synset and provides a gloss as a short explanationof its meaning for each of them
§ In addition it provides information about relations (e.g., antonymy, meronymy, hypernymy, hyponymy)between different synsets
Information Retrieval / Chapter 2: Natural Language Preprocessing
41
WordNet§ Example: Synsets for the word car
Information Retrieval / Chapter 2: Natural Language Preprocessing
42
WordNet§ Example: Direct hyponyms for the word car
Information Retrieval / Chapter 2: Natural Language Preprocessing
43
GermaNet§ GermaNet is a lexical database for German, which is
similar in terms of functionality to WordNethttp://www.sfs.uni-tuebingen.de/GermaNet/http://weblicht.sfs.uni-tuebingen.de/germanet/
Information Retrieval / Chapter 2: Natural Language Preprocessing
44
GermaNet§ Example: Synsets and relations for the word Wagen
Information Retrieval / Chapter 2: Natural Language Preprocessing
45
Wiktionary§ Wiktionary is a Wikimedia Foundation project aiming to
collaboratively construct lexical databasesfor different languageshttps://www.wiktionary.org
Information Retrieval / Chapter 2: Natural Language Preprocessing
46
Wiktionary§ Example: Available information for the word Wagen
Information Retrieval / Chapter 2: Natural Language Preprocessing
47
Universal WordNet (UWN)§ Universal WordNet (UWN) is a multilingual lexical
database connecting words and synsets across different languageshttp://www.lexvo.org/uwn/
Information Retrieval / Chapter 2: Natural Language Preprocessing
48
Universal WordNet (UWN)§ Example: Available information for the word car
Information Retrieval / Chapter 2: Natural Language Preprocessing
49
2.7 Parts of Speech§ Part-of-speech tagging adorns words in natural language
texts with their part of speech (e.g., noun, verb, adverb)
§ Part-of-speech tags are essential for other tasks such as lemmatization and named entity recognition and downstream applications such as information extraction
Information Retrieval / Chapter 2: Natural Language Preprocessing
50
POS Tags§ The Penn Treebank project provides a commonly used
list of part-of-speech tags, e.g.:
§ DT Determiner
§ JJ Adjective
§ JJR Adjective, comparative
§ NN Noun, singular or mass
§ NNS Noun, plural
§ NNP Proper noun, singular
§ NNPS Proper noun, plural
§ VB Verb, base form
§ Penn Treebank POS TagsInformation Retrieval / Chapter 2: Natural Language Preprocessing
51
Parts of Speech§ Example: POS tags determined by Stanford CoreNLP
Information Retrieval / Chapter 2: Natural Language Preprocessing
David Eric Grohl (born January 14, 1969) is an Americanmusician, singer, songwriter, record producer, multi-instrumentalist and film director. He is the founder, frontman,lead vocalist, rhythm guitarist, lead guitarist, and primarysongwriter of the rock band Foo Fighters since 1994, and wasthe longest-serving drummer for the rock band Nirvana from1990 to 1994.
Source: https://en.wikipedia.org/wiki/Dave_Grohl
52
2.8 Dependencies§ Dependency parsing identifies so-called head words in
natural language texts and identifies relationshipsbetween those and other words that modify them
§ The Universal Dependencies Framework provides a collection of consistent annotations across languageshttp://universaldependencies.org
§ Dependencies are essential for other tasks such as co-reference resolution and information extraction
Information Retrieval / Chapter 2: Natural Language Preprocessing
53
Dependencies§ Example: Dependencies determined by Stanford CoreNLP
Information Retrieval / Chapter 2: Natural Language Preprocessing
Foo Fighters have found worldwide success winning multipleawards, most notably with four of its albums winning GrammyAwards for Best Rock Album.
Source: https://en.wikipedia.org/wiki/Dave_Grohl
54
2.9 Co-Reference Resolution§ Co-reference resolution seeks to connect expressions
in a natural language text (e.g., pronouns) that refer to the same entity
§ Resolving co-references can be useful for IR systems (e.g., when generating result snippets) and otherapplications such as text summarization
Information Retrieval / Chapter 2: Natural Language Preprocessing
55
Co-Reference Resolution§ Example: Co-references determined by Stanford CoreNLP
Information Retrieval / Chapter 2: Natural Language Preprocessing
David Eric Grohl (born January 14, 1969) is an Americanmusician, singer, songwriter, record producer, multi-instrumentalist and film director. He is the founder, frontman,lead vocalist, rhythm guitarist, lead guitarist, and primarysongwriter of the rock band Foo Fighters since 1994, and wasthe longest-serving drummer for the rock band Nirvana from1990 to 1994.
Source: https://en.wikipedia.org/wiki/Dave_Grohl
56
2.10 Named Entity Recognition and Disambiguation§ Named entity recognition spots expressions in natural
language texts that refer to a named entity (e.g., person, organization, or location) and assigns a type to them
§ Recognizing named entities is a prerequisite for linking expressions in natural language texts to known named entities (e.g., in Wikipedia) and supports taskssuch as information extraction
Information Retrieval / Chapter 2: Natural Language Preprocessing
57
Named Entity Recognition§ Example: Named entities spotted by Stanford CoreNLP
Information Retrieval / Chapter 2: Natural Language Preprocessing
David Eric Grohl (born January 14, 1969) is an Americanmusician, singer, songwriter, record producer, multi-instrumentalist and film director. He is the founder, frontman,lead vocalist, rhythm guitarist, lead guitarist, and primarysongwriter of the rock band Foo Fighters since 1994, and wasthe longest-serving drummer for the rock band Nirvana from1990 to 1994.
Source: https://en.wikipedia.org/wiki/Dave_Grohl
58
Named Entity Disambiguation§ Named entity disambiguation links mentions of named
entities to articles in Wikipedia or entries in a knowledge graph
§ Spotting and disambiguating named entities in natural language texts allows for richer search functionalitiesand can support tasks such as information extraction
Information Retrieval / Chapter 2: Natural Language Preprocessing
59
Named Entity Disambiguation§ Example: Named entities linked by AIDA (Ambiverse)
Information Retrieval / Chapter 2: Natural Language Preprocessing
The Beatles built their reputationplaying clubs in Liverpool andHamburg over a three-year period from1960, with Stuart Sutcliffe initiallyserving as bass player. The core trio ofLennon, McCartney and Harrison,together since 1958, went through asuccession of drummers, includingPete Best, before asking Starr to jointhem in 1962
Source: https://en.wikipedia.org/wiki/The_Beatles
60
Natural Language Processing Toolkits§ Stanford CoreNLP (POS, dependencies, NER)
https://stanfordnlp.github.io/CoreNLP/
http://nlp.stanford.edu:8080/corenlp/
§ Natural Language Toolkit (POS, dependencies, NER)
http://www.nltk.org
§ AllenNLPhttps://allennlp.org
Information Retrieval / Chapter 2: Natural Language Preprocessing
61
Natural Language Processing Toolkits§ AIDA (NED)
https://gate.d5.mpi-inf.mpg.de/webaida/https://www.ambiverse.com/nl-api/
§ TagMe (NED)https://tagme.d4science.org/tagme/
§ DBPedia Spotlight (NED)https://www.dbpedia-spotlight.org/demo/
Information Retrieval / Chapter 2: Natural Language Preprocessing
62
Summary§ Lexical databases such a WordNet provide information
about meanings of words and relations between words
§ Part-of-speech tagging labels words in natural language texts with their part of speech (e.g., noun or verb); dependency parsing establishes relationsbetween head words and modifiers
§ Named entity recognition and disambiguation spot expressions in natural language texts that refer tonamed entities and link them, e.g., to theircorresponding Wikipedia article
Information Retrieval / Chapter 2: Natural Language Preprocessing
63
Literature[1] C. D. Manning, P. Raghavan, and H. Schütze:
Introduction to Information Retrieval,Cambridge University Press, 2008 (Chapter 1)
[2] W. B. Croft, D. Metzler, and T. Strohman:Search Engines – Information Retrievalin Practice, Pearson Education, 2009 (Chapter 1)
Information Retrieval / Chapter 2: Natural Language Preprocessing