tokenization tokens terms stemming...david eric grohl (born january 14, 1969) is an american...

27
37 Recap § Right document granularity is application-dependent § Tokenization splits documents into tokens, which are then further normalized to obtain terms as meaningful units that are kept in a dictionary and can be looked up § Stemming and lemmatization try to map different forms of the same word onto a canonical representation § Spelling mistakes by users are a problem and can be mitigated by trying to map to a known word Information Retrieval / Chapter 2: Natural Language Preprocessing

Upload: others

Post on 07-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

37

Recap§ Right document granularity is application-dependent

§ Tokenization splits documents into tokens, which are thenfurther normalized to obtain terms as meaningful unitsthat are kept in a dictionary and can be looked up

§ Stemming and lemmatization try to map different forms of the same word onto a canonical representation

§ Spelling mistakes by users are a problem and can bemitigated by trying to map to a known word

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 2: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

38

2.6 Synonymy and Polysemy§ Synonyms (i.e., different words having the same meaning)

and polysems (i.e., words having multiple meanings) are a challenge for IR systems

§ Lexical databases provide descriptions of the different meanings of a word and information about relationshipsbetween different words

§ These can then be used in an IR system, for instance,to automatically expand queries by adding synonymsor adding related words for disambiguation

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 3: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

39

Word Relations§ Different relations can exist between two words, e.g.:

§ Synonymy (identical meaning)(e.g., car – automobile, holidays – vacation)

§ Antonymy (opposite meaning)(e.g., lucky – unlucky, expensive – cheap)

§ Hypernymy (more general)(e.g., mammal – rodent, machine – computer)

§ Hyponymy (more specific)(e.g., rat – rodent, rodent – mammal)

§ Meronymy (part of relation)(e.g., tree – forest, board – computer)

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 4: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

40

WordNet§ WordNet is a lexical database for English that has been

manually curated for more than twenty yearshttps://wordnet.princeton.edu

§ WordNet groups words having the same meaning intoa synset and provides a gloss as a short explanationof its meaning for each of them

§ In addition it provides information about relations (e.g., antonymy, meronymy, hypernymy, hyponymy)between different synsets

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 5: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

41

WordNet§ Example: Synsets for the word car

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 6: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

42

WordNet§ Example: Direct hyponyms for the word car

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 7: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

43

GermaNet§ GermaNet is a lexical database for German, which is

similar in terms of functionality to WordNethttp://www.sfs.uni-tuebingen.de/GermaNet/http://weblicht.sfs.uni-tuebingen.de/germanet/

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 8: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

44

GermaNet§ Example: Synsets and relations for the word Wagen

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 9: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

45

Wiktionary§ Wiktionary is a Wikimedia Foundation project aiming to

collaboratively construct lexical databasesfor different languageshttps://www.wiktionary.org

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 10: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

46

Wiktionary§ Example: Available information for the word Wagen

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 11: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

47

Universal WordNet (UWN)§ Universal WordNet (UWN) is a multilingual lexical

database connecting words and synsets across different languageshttp://www.lexvo.org/uwn/

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 12: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

48

Universal WordNet (UWN)§ Example: Available information for the word car

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 13: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

49

2.7 Parts of Speech§ Part-of-speech tagging adorns words in natural language

texts with their part of speech (e.g., noun, verb, adverb)

§ Part-of-speech tags are essential for other tasks such as lemmatization and named entity recognition and downstream applications such as information extraction

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 14: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

50

POS Tags§ The Penn Treebank project provides a commonly used

list of part-of-speech tags, e.g.:

§ DT Determiner

§ JJ Adjective

§ JJR Adjective, comparative

§ NN Noun, singular or mass

§ NNS Noun, plural

§ NNP Proper noun, singular

§ NNPS Proper noun, plural

§ VB Verb, base form

§ Penn Treebank POS TagsInformation Retrieval / Chapter 2: Natural Language Preprocessing

Page 15: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

51

Parts of Speech§ Example: POS tags determined by Stanford CoreNLP

Information Retrieval / Chapter 2: Natural Language Preprocessing

David Eric Grohl (born January 14, 1969) is an Americanmusician, singer, songwriter, record producer, multi-instrumentalist and film director. He is the founder, frontman,lead vocalist, rhythm guitarist, lead guitarist, and primarysongwriter of the rock band Foo Fighters since 1994, and wasthe longest-serving drummer for the rock band Nirvana from1990 to 1994.

Source: https://en.wikipedia.org/wiki/Dave_Grohl

Page 16: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

52

2.8 Dependencies§ Dependency parsing identifies so-called head words in

natural language texts and identifies relationshipsbetween those and other words that modify them

§ The Universal Dependencies Framework provides a collection of consistent annotations across languageshttp://universaldependencies.org

§ Dependencies are essential for other tasks such as co-reference resolution and information extraction

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 17: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

53

Dependencies§ Example: Dependencies determined by Stanford CoreNLP

Information Retrieval / Chapter 2: Natural Language Preprocessing

Foo Fighters have found worldwide success winning multipleawards, most notably with four of its albums winning GrammyAwards for Best Rock Album.

Source: https://en.wikipedia.org/wiki/Dave_Grohl

Page 18: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

54

2.9 Co-Reference Resolution§ Co-reference resolution seeks to connect expressions

in a natural language text (e.g., pronouns) that refer to the same entity

§ Resolving co-references can be useful for IR systems (e.g., when generating result snippets) and otherapplications such as text summarization

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 19: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

55

Co-Reference Resolution§ Example: Co-references determined by Stanford CoreNLP

Information Retrieval / Chapter 2: Natural Language Preprocessing

David Eric Grohl (born January 14, 1969) is an Americanmusician, singer, songwriter, record producer, multi-instrumentalist and film director. He is the founder, frontman,lead vocalist, rhythm guitarist, lead guitarist, and primarysongwriter of the rock band Foo Fighters since 1994, and wasthe longest-serving drummer for the rock band Nirvana from1990 to 1994.

Source: https://en.wikipedia.org/wiki/Dave_Grohl

Page 20: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

56

2.10 Named Entity Recognition and Disambiguation§ Named entity recognition spots expressions in natural

language texts that refer to a named entity (e.g., person, organization, or location) and assigns a type to them

§ Recognizing named entities is a prerequisite for linking expressions in natural language texts to known named entities (e.g., in Wikipedia) and supports taskssuch as information extraction

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 21: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

57

Named Entity Recognition§ Example: Named entities spotted by Stanford CoreNLP

Information Retrieval / Chapter 2: Natural Language Preprocessing

David Eric Grohl (born January 14, 1969) is an Americanmusician, singer, songwriter, record producer, multi-instrumentalist and film director. He is the founder, frontman,lead vocalist, rhythm guitarist, lead guitarist, and primarysongwriter of the rock band Foo Fighters since 1994, and wasthe longest-serving drummer for the rock band Nirvana from1990 to 1994.

Source: https://en.wikipedia.org/wiki/Dave_Grohl

Page 22: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

58

Named Entity Disambiguation§ Named entity disambiguation links mentions of named

entities to articles in Wikipedia or entries in a knowledge graph

§ Spotting and disambiguating named entities in natural language texts allows for richer search functionalitiesand can support tasks such as information extraction

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 23: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

59

Named Entity Disambiguation§ Example: Named entities linked by AIDA (Ambiverse)

Information Retrieval / Chapter 2: Natural Language Preprocessing

The Beatles built their reputationplaying clubs in Liverpool andHamburg over a three-year period from1960, with Stuart Sutcliffe initiallyserving as bass player. The core trio ofLennon, McCartney and Harrison,together since 1958, went through asuccession of drummers, includingPete Best, before asking Starr to jointhem in 1962

Source: https://en.wikipedia.org/wiki/The_Beatles

Page 24: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

60

Natural Language Processing Toolkits§ Stanford CoreNLP (POS, dependencies, NER)

https://stanfordnlp.github.io/CoreNLP/

http://nlp.stanford.edu:8080/corenlp/

§ Natural Language Toolkit (POS, dependencies, NER)

http://www.nltk.org

§ AllenNLPhttps://allennlp.org

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 25: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

61

Natural Language Processing Toolkits§ AIDA (NED)

https://gate.d5.mpi-inf.mpg.de/webaida/https://www.ambiverse.com/nl-api/

§ TagMe (NED)https://tagme.d4science.org/tagme/

§ DBPedia Spotlight (NED)https://www.dbpedia-spotlight.org/demo/

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 26: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

62

Summary§ Lexical databases such a WordNet provide information

about meanings of words and relations between words

§ Part-of-speech tagging labels words in natural language texts with their part of speech (e.g., noun or verb); dependency parsing establishes relationsbetween head words and modifiers

§ Named entity recognition and disambiguation spot expressions in natural language texts that refer tonamed entities and link them, e.g., to theircorresponding Wikipedia article

Information Retrieval / Chapter 2: Natural Language Preprocessing

Page 27: Tokenization tokens terms Stemming...David Eric Grohl (born January 14, 1969) is an American musician, singer, songwriter, record producer, multi-instrumentalistandfilmdirector.Heisthefounder,frontman,

63

Literature[1] C. D. Manning, P. Raghavan, and H. Schütze:

Introduction to Information Retrieval,Cambridge University Press, 2008 (Chapter 1)

[2] W. B. Croft, D. Metzler, and T. Strohman:Search Engines – Information Retrievalin Practice, Pearson Education, 2009 (Chapter 1)

Information Retrieval / Chapter 2: Natural Language Preprocessing