open nlp presentationss
Post on 17-Jul-2015
112 Views
Preview:
TRANSCRIPT
Importance of NLP
Preface of OpenNLP
Task of NLP
NLP task by OpenNLP
Introduction
Installation OpenNLP
Huge amount of Data
Classify text into Categories
Index and Search Large Text
Automatic Translation
Speech Understanding
Information Extraction
Automatic Summarization Question Answering
Natural Language
Processing
“Natural Language Processing is a theoretically
motivated range of computational techniques for
analyzing and representing naturally occurring texts
at one or more levels of linguistic analysis for the
purpose of achieving human-like language
processing for a range of tasks or applications”
(Liddy et al.,2001)
Natural Language: Refers to the language spoken by people
eg. English, Hindi etc. Opposed to artificial Language like Java
Computer Science
Database AI Algorithms …
Robotics NLP Search
Information Retrieval Language Analysis Translation
Text Based Application
Dialogue Based Application
Speech Recognition (E.g. IBM VoiceType Dictation)
Spoken Language System(E.g. Dragon, Operetta)
Language Translation
Information Retrieval
Email Understanding
Natural Language Generation(E.g. CoGenTex)
Question Answering
Summarization(E.g. NetOWL extractor)
NLP Task
Segmentation
Segmentation also known as sentence breaking, is the problem
in natural language processing of deciding where sentences
begin and end
NLP Task
Tokenization
Tokenization is the process of breaking a stream of text up into
words, phrases, symbols, or other meaningful elements called
tokens
Electronic text is a linear sequence of Symbols
Before any real text processing text need to be segmented
This is Tokenization. theThis segments sentence
Segmented Text
Abbreviation
Hyphenated Words
Numerical and Spl. Exp
Electronic text is a linear sequence of Symbols
Before any real text processing text need to be segmented
This
is
Tokenization.
theThis
segmentssentenceSegmented Text
Abbreviation
Hyphenated Words
Numerical and Spl. Exp
NLP Task
POS Tagging
POS Tagging is the process of marking up a word in a text as
corresponding to a particular part of speech, based on both
its definition, as well as its context
POST- grammatical tagging or word-category disambiguation
Identification of words as nouns, verbs, adjectives, adverbs…
CC
CD
DT
FW
JJ
JJR
NN
Co-conjuction
Cardinal Num
Determiner
Foreign Words
Adjective
Adj.Com
Noun
VB
VBD
RB
RBR
RBS
SYM
NNP
Verb
Verb,Past
Adverb
Adverb Com.
Adverb S.
Symbol
Proper N.
NLP Task
Name Entity Extraction
Named-entity recognition (NER) is a subtask of information
extraction that seeks to locate and classify elements in text into
pre-defined categories such as the names of persons,
organizations, locations, expressions of times, quantities,
monetary values, percentages, etc.
NLP Task
Chunking
Chunking is also called shallow parsing and it's basically the
identification of parts of speech and short phrases
NLP Task
Parsing
Parsing is process of analysing a sentence by taking each word
and determining its structure from its constituent parts
Eg.<S>= “John Loves Mary”
<NP>(John) <VP> (Loves Mary)
<S>
<N>(John)
John
<V> (Loves ) <NP>( Mary)
Loves
<N>( Mary)
Mary
NLP Task
Co-reference Resolution
Co-reference occurs when two or more expressions in a text
refer to the same person or thing they have the same referent
OpenNLP is a library for Natural Language Processing
Open Source and Developed by Apache Foundation
Stable Release 1.5.3 in 2013
Java Based and Cross Platform
OpenNLP is capable of doing NLP task
OpenNLP provides API’s for NLP task
Text…………………………………
…End
Segmentation
POS Tagging
Tokenization NER
ChunkingParingCo-reference
resolution
OpenNLP Task
POS Tagging
Tokenizatioin
NER
Chunking
Parsing
Co-Reference
Segmentation
D.Categorization
Tokenization
Whitespace Simple Learnable
A whitespace tokenizer, non whitespace sequences are identified as tokens
A character class tokenizer, sequences of the same character class are tokens
A maximum entropy tokenizer, detects token boundaries based on probability model
It expects a tokenized sentence as input, which is represented as a String array
Each String object in the array is one token
The POS tags associated with each token
Document Categorizer Classify text into Predefined
Category
Based on the Maximum Entropy Model
Unlike Other Task OpenNLP Does Not Provide Predefined Model for
Document Categorization
To use this facility Build Model
The application must open a sample data stream
Call the POSTagger.train method
The application must open a sample data stream
Training Data Format: About_IN 10_CD Euro_NNP
The Parser can be trained on annotated training
material
The data can be in OpenNLP Format
:Training Data Format:(TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) ))
(TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') ))
The Document Categorizer can be trained on annotated
training material
The data can be in OpenNLP Document Categorizer
Training Format
:Training Data Format:
Computer Science is the study of computers and computational\systems. Unlike electrical and computer engineers,\computer scientists deal mostly with software and \software systems; this includes their theory, design\development, and application.
Open Source Tool
Easy to Install and Use
Multilingual Model Facility(English, Spanish, Thai etc.)
Easy Development of Model
Cross Platform
Document categorization
References:
Avram, S., Caragea, D. and Borangiu, T.(2014). NLP applications in
external plagiarism detection. U.P.B. Sci. Bull., Series C,
76(3):29-36.
Benjamin, C. M. X. , Mahmud, R. , Qiang, L., Sadanandan, A. A.,
Onn, K. W. and Lukose, D.(2014). “Malay Semantic Text
Processing Engine”, In the Proceedings of the International
Conference of Conference on Information, Process, and
Knowledge Management. pp.38-43.
Liu, F., Vasardani,M. and Baldwin,T.(2012) Automatic Identification
of Locative Expressions from Social Media Text: A
Comparative Analysis. International Journal of Computer
Applications,10, 150-156.
References:
http://en.wikipedia.org/wiki/Named-entity_recognition (Accessed
2015-02-24)
http://en.wikipedia.org/wiki/OpenNLP (Accessed 2015-02-15)
http://en.wikipedia.org/wiki/Part-of-speech_tagging (Accessed
2015- 02-24)
http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
(Accessed 2015-02-24)
http://en.wikipedia.org/wiki/Shallow_parsing (Accessed 2015-02-
24)
http://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)
(Accessed 2015-02-18)
http://language.worldofcomputing.net/category/parsing (Accessed
2015-03-06)
http://opennlp.apache.org/cgi-bin/download.cgi (Accessed 2015-02-
05)
References:
Liddy, E. D.(2011). Natural Language Processing In: Encyclopedia
of Library and Information Science, 2nd Ed. Marcel
Decker, Inc.pp. 362-386.
Michael, H., Jerald L., Huanying, G. Paolo, G.(2014).Privacy-
Preserving Symptoms-to-Disease Mapping on Smartphones
. Mobile and Information Technologies in Medicine,10,350-
354.
top related