open nlp presentationss

62

Upload: chandan-deb

Post on 17-Jul-2015

112 views

Category:

Software


1 download

TRANSCRIPT

OpenNLP: A Tool for Natural Language Processing

CA-691

Importance of NLP

Preface of OpenNLP

Task of NLP

NLP task by OpenNLP

Introduction

Installation OpenNLP

Applications

Training of OpenNLP

Parallel Technology

Conclusion

References

Huge amount of Data

Classify text into Categories

Index and Search Large Text

Automatic Translation

Speech Understanding

Information Extraction

Automatic Summarization Question Answering

Natural Language

Processing

“Natural Language Processing is a theoretically

motivated range of computational techniques for

analyzing and representing naturally occurring texts

at one or more levels of linguistic analysis for the

purpose of achieving human-like language

processing for a range of tasks or applications”

(Liddy et al.,2001)

Natural Language: Refers to the language spoken by people

eg. English, Hindi etc. Opposed to artificial Language like Java

Computer Science

Database AI Algorithms …

Robotics NLP Search

Information Retrieval Language Analysis Translation

Computer Science

AI

NLP

Language Analysis

Text Based Application

Dialogue Based Application

Speech Recognition (E.g. IBM VoiceType Dictation)

Spoken Language System(E.g. Dragon, Operetta)

Language Translation

Information Retrieval

Email Understanding

Natural Language Generation(E.g. CoGenTex)

Question Answering

Summarization(E.g. NetOWL extractor)

NLP Task

Segmentation

Segmentation also known as sentence breaking, is the problem

in natural language processing of deciding where sentences

begin and end

NLP Task

Tokenization

Tokenization is the process of breaking a stream of text up into

words, phrases, symbols, or other meaningful elements called

tokens

Electronic text is a linear sequence of Symbols

Before any real text processing text need to be segmented

This is Tokenization. theThis segments sentence

Segmented Text

Abbreviation

Hyphenated Words

Numerical and Spl. Exp

Electronic text is a linear sequence of Symbols

Before any real text processing text need to be segmented

This

is

Tokenization.

theThis

segmentssentenceSegmented Text

Abbreviation

Hyphenated Words

Numerical and Spl. Exp

NLP Task

POS Tagging

POS Tagging is the process of marking up a word in a text as

corresponding to a particular part of speech, based on both

its definition, as well as its context

POST- grammatical tagging or word-category disambiguation

Identification of words as nouns, verbs, adjectives, adverbs…

CC

CD

DT

FW

JJ

JJR

NN

Co-conjuction

Cardinal Num

Determiner

Foreign Words

Adjective

Adj.Com

Noun

VB

VBD

RB

RBR

RBS

SYM

NNP

Verb

Verb,Past

Adverb

Adverb Com.

Adverb S.

Symbol

Proper N.

Natural Language Processing is a field of Computer Science

JJ NN NN VBZ DT NN IN NN NN

NLP Task

Name Entity Extraction

Named-entity recognition (NER) is a subtask of information

extraction that seeks to locate and classify elements in text into

pre-defined categories such as the names of persons,

organizations, locations, expressions of times, quantities,

monetary values, percentages, etc.

NLP Task

Chunking

Chunking is also called shallow parsing and it's basically the

identification of parts of speech and short phrases

NLP Task

Parsing

Parsing is process of analysing a sentence by taking each word

and determining its structure from its constituent parts

Eg.<S>= “John Loves Mary”

<NP>(John) <VP> (Loves Mary)

<S>

<N>(John)

John

<V> (Loves ) <NP>( Mary)

Loves

<N>( Mary)

Mary

NLP Task

Co-reference Resolution

Co-reference occurs when two or more expressions in a text

refer to the same person or thing they have the same referent

Eg. “Bill said that he would come.”

he

Bill

OpenNLP is a library for Natural Language Processing

Open Source and Developed by Apache Foundation

Stable Release 1.5.3 in 2013

Java Based and Cross Platform

OpenNLP is capable of doing NLP task

OpenNLP provides API’s for NLP task

Text…………………………………

…End

Segmentation

POS Tagging

Tokenization NER

ChunkingParingCo-reference

resolution

http://opennlp.apache.org/

http://opennlp.apache.org/

http://opennlp.sourceforge.net/models-1.5/

OpenNLP Task

POS Tagging

Tokenizatioin

NER

Chunking

Parsing

Co-Reference

Segmentation

D.Categorization

Tokenization

Whitespace Simple Learnable

A whitespace tokenizer, non whitespace sequences are identified as tokens

A character class tokenizer, sequences of the same character class are tokens

A maximum entropy tokenizer, detects token boundaries based on probability model

It expects a tokenized sentence as input, which is represented as a String array

Each String object in the array is one token

The POS tags associated with each token

Document Categorizer Classify text into Predefined

Category

Based on the Maximum Entropy Model

Unlike Other Task OpenNLP Does Not Provide Predefined Model for

Document Categorization

To use this facility Build Model

Open a sample data stream

SentenceDetectorME.train

Save the SentenceModel

Open a sample data stream

TokenizerME.train

Save TokenizerModel

The application must open a sample data stream

Call the POSTagger.train method

The application must open a sample data stream

Training Data Format: About_IN 10_CD Euro_NNP

The Parser can be trained on annotated training

material

The data can be in OpenNLP Format

:Training Data Format:(TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) ))

(TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') ))

The Document Categorizer can be trained on annotated

training material

The data can be in OpenNLP Document Categorizer

Training Format

:Training Data Format:

Computer Science is the study of computers and computational\systems. Unlike electrical and computer engineers,\computer scientists deal mostly with software and \software systems; this includes their theory, design\development, and application.

Distinguo

Open Source Tool

Easy to Install and Use

Multilingual Model Facility(English, Spanish, Thai etc.)

Easy Development of Model

Cross Platform

Document categorization

References:

Avram, S., Caragea, D. and Borangiu, T.(2014). NLP applications in

external plagiarism detection. U.P.B. Sci. Bull., Series C,

76(3):29-36.

Benjamin, C. M. X. , Mahmud, R. , Qiang, L., Sadanandan, A. A.,

Onn, K. W. and Lukose, D.(2014). “Malay Semantic Text

Processing Engine”, In the Proceedings of the International

Conference of Conference on Information, Process, and

Knowledge Management. pp.38-43.

Liu, F., Vasardani,M. and Baldwin,T.(2012) Automatic Identification

of Locative Expressions from Social Media Text: A

Comparative Analysis. International Journal of Computer

Applications,10, 150-156.

References:

http://en.wikipedia.org/wiki/Named-entity_recognition (Accessed

2015-02-24)

http://en.wikipedia.org/wiki/OpenNLP (Accessed 2015-02-15)

http://en.wikipedia.org/wiki/Part-of-speech_tagging (Accessed

2015- 02-24)

http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation

(Accessed 2015-02-24)

http://en.wikipedia.org/wiki/Shallow_parsing (Accessed 2015-02-

24)

http://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)

(Accessed 2015-02-18)

http://language.worldofcomputing.net/category/parsing (Accessed

2015-03-06)

http://opennlp.apache.org/cgi-bin/download.cgi (Accessed 2015-02-

05)

References:

Liddy, E. D.(2011). Natural Language Processing In: Encyclopedia

of Library and Information Science, 2nd Ed. Marcel

Decker, Inc.pp. 362-386.

Michael, H., Jerald L., Huanying, G. Paolo, G.(2014).Privacy-

Preserving Symptoms-to-Disease Mapping on Smartphones

. Mobile and Information Technologies in Medicine,10,350-

354.