i256 applied natural language processing fall 2009 lecture 2 python related fields linguistic...

35
I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Post on 15-Jan-2016

251 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

I256

Applied Natural Language Processing

Fall 2009

Lecture 2

• Python• Related fields• Linguistic essentials

Barbara Rosario

Page 2: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Today

• Announcements– I admitted all the students in the waiting list. Tele-Bears should

reflect the change by today.– Any questions/concerns about the class?– Homework due next Tuesday September 8 at 12:30

• Make sure you are all set to start with Python & NLTK– Office hours (Room 6)

• Today: Gopal at 2• Wednesday 3-4: Gopal (iIf there is request, let him know) • Thursday: Barbara at 2

– Some (light) readings for Thursday • Python• Related fields• Linguistic essentials

Page 3: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Python - Simple yet powerful

The zen of python : http://www.python.org/dev/peps/pep-0020/

• Very clear, readable syntax• Strong introspection capabilities

– http://www.ibm.com/developerworks/library/l-pyint.html (recommended) • Intuitive object orientation• Natural expression of procedural code• Full modularity, supporting hierarchical packages• Exception-based error handling• Very high level dynamic data types• Extensive standard libraries and third party modules for virtually every task

– Excellent functionality for processing linguistic data.– NLTK is one such extensive third party module. 

Source : python.org

Python

Page 4: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

• Numeric types– plain integers - long in C, 32 bit precision (try: sys.maxint) 

– long integers -(unlimited precision)

– floating point numbers  

– complex numbers

• Sequences– Strings (immutable)

– Lists (mutable)

– Tuples (immutable)

• Mappings– Dictionary

• File objects

• Classes

• Instances

• Exceptions

Source : python.org

Python (built-in types)

Page 5: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

LISTS • More than an ‘array’. • Hold arbitrary objects and expand/collapse dynamically.

Source : python.org

Python (Lists and tuples)

>>> mylist=[‘nlp’,42577,256,’applied_nlp’]>>> mylist[3]‘applied_nlp’ >>> mylist[-1]‘applied_nlp’>>> mylist[1:3][42577,256]

Define using standard array like syntaxFew methods

List li

•len(li)•li.append(‘something’)•li.extend([list])•li.insert(index,’value’)•li.index(“nlp”)•li.remove(“nlp”)•li=li+[list]…….………..

TUPLE• A tuple is an immutable list. Cannot be changed once created.

>>> mytuple=(‘nlp’,42577,256,’applied_nlp’)>>> mytuple[3]’applied_nlp’>>> mytuple[3]=‘blahblah’Traceback (most recent call last):  File "<stdin>", line 1, in <module>TypeError:’tuple’ object does not support item assignment

Page 6: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

• Provides many string manipulation methods

• Strings can be subscripted (indexed)– Can use some list style methods

• String formatting (the % operator)

Source : python.org

Python (Strings)

Few methods

String str

•len(str)•str.capitalize()•str.count(sub[, start[, end]])•str.find(sub[, start[, end]]) •str.replace(old, new[, count])•str.strip([chars])• str.split([sep[, maxsplit]])…….………..

>>> mystring=“jolly good”>>>mystring[1:5]‘olly’

>>> print “this is a %s course”%(“NLP”)“this is a NLP course”>>> print “this is a %s course in fall%d”%(“NLP”,9)“this is a NLP course in fall9”>>> print “this is %(course)s course”%{‘course’:”NLP”}“this is a NLP course”

>>> print “uc” + “berkeley”“ucberkeley” >>> li = [‘a',‘b',‘c’,‘d']>>> s = ";".join(li)>>> s‘a;b;c;d'>>> s.split(";")[‘a',‘b',‘c’,‘d']

Page 7: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

• A mapping object maps hashable values to arbitrary objects. • Mappings are mutable objects. • There is currently only one standard mapping type, the dictionary.

• Creating dictionaries

Source : python.org

Python (Mapping objects)

>>> mydict={‘nlp’:42577,256:’applied_nlp’}>>>mydict[256]‘applied_nlp’

comma-separated list of key: value pairs within braces

dict(one=2, two=3)dict({'one': 2, 'two': 3})dict(zip(('one', 'two'), (2, 3)))dict([['two', 3], ['one', 2]])

Using the constructor of a built-in dict class

Few methods

Dictionary d

•len(d)•d[key]•d[key] = value•del d[key]•key in d•clear()•copy)()•get(key[, default])•Items()•iteritems()…….………..

Page 8: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Submission for assignment 1For Assignment 1 (see also web site)

• create a file LastNameFirstName_assignment1.py • This is the main file where all your code will reside.• We will evaluate each question/sub-question as

• Add logic to your code based on the command line argument (process your command line argument string ) and output accordingly. The command line arguments in python are accessed through sys.argv list . You can also use getopt module.

• Make sure you include a this header information in the beginning of your code

For question on the homework, please email [email protected]

email your assignment to [email protected] and [email protected]

>>> python LastNameFirstName_assignment1.py question1

>>> python LastNameFirstName_assignment1.py question1.1

#! /usr/bin/env python   #author: ‘Your name' #email = ‘your email address' #python_version = ‘python version you are using'

Page 9: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Related Fields

• NLP• Linguistics

– All about languages

• Computational Linguistics– Using computational methods to learn more about how language

works

• Speech Recognition– Mapping audio signals to text– Two components: acoustic models and language models– Language models in the domain of stat NLP

• Cognitive Science– Figuring out how the human brain work, including language

Page 10: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Linguistics essentials

• Important distinction: – study of language structure (grammar)– study of meaning (semantics)

• Grammar– Phonology (the study of sound systems and abstract

sound units).– Morphology (the formation and composition of words)– Syntax (the rules that determine how words combine

into sentences) • Semantics

– The study of the meaning of words (lexical semantics) and fixed word combinations (phraseology), and how these combine to form the meanings of sentences

http://en.wikipedia.org/wiki/Linguistics

Page 11: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Linguistics sub-fields

• Discourse analysis – concerned with the structure of texts and

conversations

• Pragmatics – concerned with how meaning is transmitted

based on a combination of linguistic competence, non-linguistic knowledge, and the context of the speech act.

Page 12: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Linguistics sub-fields• Evolutionary linguistics

– origins of language• Historical linguistics

– explores language change• Sociolinguistics

– looks at the relation between linguistic variation and social structures• Psycholinguistics

– explores the representation and functioning of language in the mind• Neurolinguistics

– looks at the representation of language in the brain• Language acquisition

– how children acquire their first language and how children and adults acquire and learn their second and subsequent languages

• And others:– for an overview see http://en.wikipedia.org/wiki/Linguistics

Adapted from http://en.wikipedia.org/wiki/Linguistics

Page 13: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Linguistics essentials

• This course:

• Some grammar

• Mostly “semantics”

Page 14: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Grammar: words

• Words of a language are grouped into classes to reflect similar syntactic behaviors

• Syntactical or grammatical categories (aka part-of-speech)– Nouns (people, animal, concepts)– Verbs (actions, states)– Adjectives– Prepositions– Determiners

• Open or lexical categories (nouns, verbs, adjective)– Large number of members, new words are commonly added

• Closed or functional categories (prepositions, determiners)– Few members, clear grammatical use

Page 15: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Grammar: words

• Word categories are related by morphological processes– s for plural nouns– ed for verbs’ past forms– Next class– Why important for NLP?– More important for some languages

• English regular verbs have 4 forms (at most 8 in irregular verbs)

• Finnish verbs have 10,000 forms

Page 16: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Grammatical categories

• Nouns typically refer to entities in the world like people, animals, things, ideas..

• Type of inflections– Number – Gender – Case (nominative, genitive, accusative,

dative)

• Pronouns: variables to refer to an entity previously mentioned

Page 17: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Grammatical categories: Verbs

• Usually denote an action (bring, read), an occurrence (decompose, glitter), or a state of being (exist, stand).

• Depending on the language, a verb may vary in form according to many factors, possibly including its tense, aspect, mood and voice.

• It may also agree with the person, gender, and/or number of some of its arguments (subject, object, etc.)

Page 18: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Verbs’ factors

• Tense: time of the action– Present, past, future

• Mood: signal modality (possibility and necessity)– Realis mood

– The state is known (John is sick)

– Irrealis mood – Indicate that a certain situation or action is not known to

have happened as the speaker is talking. – Just may/must be sick

Page 19: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Verbs’ factors

• Aspect– Defines the temporal flow (or lack thereof) in

the event or state. – Habitual aspect

• I eat, I have eaten, I ate, I had eaten

– Progressive, or continuous, aspect• I am eating, I have been eating, I was eating, I had

been eating

Page 20: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Verbs’ factors

• Voice– Describes the relationship between the action

(or state) that the verb expresses and the participants identified by its arguments (subject, object, etc.).

– Active voice: when the subject is the agent or actor of the verb (the cat ate the mouse)

– Passive voice: when the subject is the patient, target or undergoer of the action (the mouse was eaten by the cat)

Page 21: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Other grammatical categories

• Adverbs• Prepositions

– In, on, over, at

• Coordinating Conjunctions– Link 2 sentences

• and, or, but…• She bought or leased the car

• Subordinating Conjunctions• That, because, if…• She said that she would lease a car

Page 22: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Phrase structure

• Words are organized in phrases

• Phrases: grouping of words that are clumped as a unit

• Syntax: study of the regularities and constraints of word order and phrase structure

Page 23: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Major phrase types

• Sentence (S) (whole grammatical unit). Normally rewrites as a subject noun phrase and a verb phrase

• Noun phrase (NP): phrase whose head is a noun or a pronoun, optionally accompanied by a set of modifiers – Head is the word that determines the syntactic

type of the phrase– The smart student of physics with long hair

determiner adjective complements (prepositional phrase)

(post) modifier(prepositional phrase)

Page 24: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Major phrase types

• Prepositional phrases (PP)– Headed by a preposition and containing a NP

• She is [on the computer]• They walked [to their school]

• Verb phrases (VP)– Phrase whose head is a verb

• [Getting to school on time] was a struggle• He [was trying to keep his temper]• That woman [quickly showed me the way to hide]

Page 25: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Phrase structure grammar

• Syntactic analysis of sentences– (Ultimately) to extract meaning:

• Mary gave Peter a book• Peter gave Mary a book

• Rewrite rules– Category category* (i.e. the symbol on the

left side can be rewritten as the sequence of symbols on the right side)

– Start symbol is S (for sentence)

Page 26: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Phrase structure grammar

• S NP VP

• NP AT NN

• NP NP PP

• VP VP PP

• VP VP

• PP IN NP

• AT the• NN child• NN cat• NN box• VP sleep• VP eat• IN in• IN of

Lexicon

The cat sleeps

The cat sleeps in the box

The cat hopes she can sleeps in the box NO

Page 27: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Context free grammars

• The rewrite rules depend solely on the category and not on any surrounding context: Context Free Grammar

• Main problems:– Identify these grammars for natural languages

(linguistics)– Known the grammar, identify the phrase

structures of sentences (NLP, parsing)

Page 28: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Phrase structure parsing

• Parsing: the process of reconstructing the derivation(s) or phrase structure trees that give rise to a particular sequence of words

• Parse is a phrase structure tree– New art critics write reviews with computers

Page 29: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Phrase structure parsing & ambiguity

• The children ate the cake with a spoon

• PP Attachment Ambiguity

• Why is it important for NLP?

Page 30: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Semantics

• Semantics is the study of the meaning of words, construction and utterances

1. Study of the meaning of individual words (lexical semantics)

2. Study of how meanings of individual words are combined into the meaning of sentences (or larger units)

Page 31: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Lexical semantics

• How words are related with each other• Hyponymy

– scarlet, vermilion, carmine, and crimson are all hyponyms of red

• Hypernymy• Antonymy (opposite)

– Male, female

• Meronymy (part of)– Tire is meromym of car

• Etc..

Page 32: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Semantics: beyond individual words

• Once we have the meaning of the individual words, we need to assemble them to et the meaning of the whole sentence

• Hard because natural language does not obey the principle of compositionality by which the meaning of the whole can be predicted by the meanings of the parts

Page 33: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Semantics: beyond individual words:complications

• Collocations– White skin, white wine, white hair

• Idioms: meaning is opaque– Kick the bucket

• Scope– Everyone didn’t go to the movie

1. Everyone’s scope is over not (i.e. not one person went to the movie)

2. Negation not has scope over everyone (at least one person didn’t go)

Page 34: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Semantics: beyond individual words

• Discourse

• Anaphoric relations– Mary helped Peter get out of the cat. He

thanked her. [He and Peter are the same person, her and Mary too]

Page 35: I256 Applied Natural Language Processing Fall 2009 Lecture 2 Python Related fields Linguistic essentials Barbara Rosario

Next class

• Syntax of words• Morphology• Stemming

– Collapse related morphological forms to the original lexeme

– Sit, sits, sitting, sat lexeme: sit

• Tokenization– Divide text into units (words, numbers etc)

• Word segmentation– For languages with no spaces between words