opennlp demo

Post on 26-Jan-2015

165 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

this ppt was prepared on ubuntu ,so might effect some formatting while opened in windows

TRANSCRIPT

Samatha

Gagan Sunil

What is NLP?

• NLP provides means of analyzing text

• The goal of NLP is to make computers analyze and understand the languages that humans use naturally

• Interaction between Computers-Humans

Why Natural Language Processing?

• kJfmmfj mmmvvv nnnffn333• Uj iheale eleee mnster vensi credur• Baboi oi cestnitze

• Computers “see” text in English the same way you have seen above!

• People have no trouble understanding language• Computers have

– No common sense knowledge– No reasoning capacity

raw(unstructured)

text

part-of-speechtagging

named entityrecognition

deepsyntacticparsing

annotated(structured)

text

Natural Language Processing

………………………………..………………………………………….………....... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. ……………………………………………………………..

Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .

NN IN NN VBZ VBN IN NN IN JJ NN NNS .

PP PP NP

PP

VP

VP

NP

NP

S

Source: personalpages.manchester.ac.uk/staff/Sophia.Ananiadou/DTCII.ppt

Uses of NLP

• Text based application

• Dialogue based application

• Information extractionExtract useful information. e.g. resumes

• Automatic summarizationCondense 1 book into 1 page

What is ?

OpenNLP is a open source, java-based NLP tools which perform 1. sentence detection,2. Tokenization, 3. pos-tagging, 4. parsing, 5. named-entity detection using the OpenNLP package.1

1http://opennlp.sourceforge.net/

Use of openNLP in our University project

• It can be used in “searching” names using Named entity recognition.

OpenNLP is used for:

• Sentence splitting

• Tokenization

• Part-of-speech tagging

• Named entity recognition

• Chunking

• Treebank Parser

Sentence splittingsentence boundary = period + space(s) + capital letter

Unusually, the gender of crocodiles is determined by temperature.

If the eggs are incubated tat over 33c, then the egg hatches into a male or 'bull' crocodile.

At lower temperatures only female or 'cow' crocodiles develop.

Unusually, the gender of crocodiles is determined by temperature. If the eggs are incubated tat over 33c, then the egg hatches into a male or 'bull' crocodile. At lower temperatures only female or 'cow' crocodiles develop.

sentDetect(s, language = "en", model = NULL)

A character vector with texts from which sentences

should be detected. A character string giving the language of s. This

argument is only used if model is NULL for selecting a default model.

A model. If model is NULL then a default model for

sentence detection is loaded from the corresponding openNLP models language package.

s

language

model

http://opennlp.sourceforge.net/

Tokenization

• Convert a sentence into a sequence of tokens

• Divides the text into smallest units (usually words), removingpunctuation.

Rule:

• Use spaces as the boundaries• Adds spaces before and after special characters

tokenize(s, language = "en", model = NULL)

http://opennlp.sourceforge.net/

Tokenization

"A Saudi Arabian woman can get a divorce if her husband doesn't give her coffee."

" A Saudi Arabian woman can get a divorce if her husband does n't give her coffee . "

Part-of-speech tagging

Assign a part-of-speech tag to each token in a sentence.

Most/JJS lipstick/NN is/VBZ partially/RB made/VBN of/IN fish/NN scales/NNS

Most lipstick is partially made of fish scales

tagPOS(sentence, language = "en", model = NULL, tagdict = NULL)

http://opennlp.sourceforge.net/

Part of speech tags1

CC - Coordinating conjunctionCD - Cardinal numberDT - DeterminerEX - Existential thereFW - Foreign wordIN - Preposition or subordinating conjunctionJJ - AdjectiveJJR - Adjective, comparativeJJS - Adjective, superlativeNN - Noun, singular or massNNS - Noun, pluralNNP - Proper noun, singularNNPS - Proper noun, pluralPDT – Predeterminer

NP - Noun Phrase.

PP - Prepositional Phrase

VP - Verb Phrase.

PRP - Personal pronounRB - AdverbRBR - Adverb, comparativeRBS - Adverb, superlativeRP - ParticleSYM - SymbolTO - toUH - InterjectionVB - Verb, base formVBD - Verb, past tenseVBG - Verb, gerund or present participleVBN - Verb, past participleVBP - Verb, non-3rd person singular presentVBZ - Verb, 3rd person singular presentWDT - Wh-determinerWP - Wh-pronounWRB - Wh-adverb

1 http://bulba.sdsu.edu/jeanette/thesis/PennTags.html

Named-Entity Recognition

• Named entity recognition classify tokens in text into predefined categories such as date, location, person, time.

• The name finder can find up to seven different types of entities - date, location, money, organization, percentage, person, and time.

15

Named-Entity Recognition

Diana Hayden was in Philadelphia city on 3rd october

<namefind/person>Diana Hayden</namefind/person> was

in<namefind/location>Philadelphia</namefind/location> city on<namefind/date>3rd october</namefind/date>

Chunking (shallow parsing)

He reckons the current account deficit will narrow toNP VP NP VP PPonly # 1.8 billion in September . NP PP NP

A chunker (shallow parser) segments a sentence into meaningful phrases.

Source: personalpages.manchester.ac.uk/staff/Sophia.Ananiadou/DTCII.ppt

Tree bank parser

It tags tokens and groups phrases into a tree.

(TOP (S (NP (DT A) (NN hospital) (NN bed)) (VP (VBZ is) (NP (NP (DT a) (VBN parked) (NN taxi)) (PP (IN with) (NP (DT the) (NN meter) (VBG running)))))))

A hospital bed is a parked taxi with the meter running

S

NP VP

DT NN NN VBZ NP

NP

DT VBN NN

PP

IN NP

DT NN VBG

a hospital bed is a parked taxi with the meter running

Visualization of Treebank Parser

top related