natural language processing in practice
DESCRIPTION
TRANSCRIPT
Natural Language Processingin practice
Topics
* Overview of NLP* Getting Data* Models & Algorithms* Building an NLP system* A practical example
A bit about me* Lisp programmer* Architect and research lead at Grammarly (3+ years of NLP work)* Teacher at KPI: Operating Systems
* Links:http://lisp-univ-etc.blogspot.comhttp://github.com/vselovedhttp://twitter.com/vseloved
A bit about Grammarly
(c) xkcd
The best English language writing enhancement app:Spellcheck - Grammar check - Style improvement - Synonyms and word choice - Plagiarism check
What is NLP?Transforming free-form text into structured data and back
Intersection of Comp Sci & Linguistics & Software Eng
Based on Algorithms, Machine Learning, and Statistics
Popular NLP problems* Spam Filtering* Spelling Correction* Sentiment Analysis* Question Answering* Machine Translation* Text Summarization* Search (also IR)
http://www.paulgraham.com/spam.htmlhttp://norvig.com/spell-correct.html
(c) gettyimages
Levels of NLP* data & tools* models* production-ready systems
Role of Linguistics
NLP Datastructured semi-structured – unstructured–
“Data is ten times more
powerful than algorithms.”
-- Peter NorvigThe UnreasonableEffectiveness of Data.http://youtu.be/yvDCzhbjYWs
Kinds of data* Dictionaries* Corpora* User Data
Where to get data?* Linguistic Data Consortium http://www.ldc.upenn.edu/ * Google ngrams, book ngrams, syntactic ngrams* Wikimedia* Wordnet* APIs: Twitter, Wordnik, ...* University sites: Stanford, Oxford, CMU, ...
Create your own!* Linguists* Crowdsourcing* By-product
-- Johnatahn Zittrain http://goo.gl/hs4qB
Tools* analysis tools* processing tools
* Unix command line* XML processing* Map-reduce systems* R, Python, Lisp
(c) O'Reilly Media
Algorithms
* Dynamic Programming* Search Algorithms* Tree Algorithms
Beyond Algorithms
* CKY constituency parsing* Noisy channel spelling correction* TF-IDF document classification* Bayesian filtering
Models
* generative vs discriminative* statistical vs rule-based
Language ModelsNgrams
Generative ML models:* Bayesian inference (bag-of-words model)* Hidden Markov model (sequence model)* Neural networks (holistic model)
LM + Domain Model
Discriminative Models
* Heuristic* Maximum Entropy* “Advanced” LM Models
Going Into Prod
* Translate real-world requirements into a measurable goal * Pre- and post- processing * Don't trust research results * Gather user feedback
Practical Example:Language Detection
IdeaStandard approach:character LM
Let's try an alternative:word LM
Data – from WiktionaryTest data from Wikipedia–
Practical ML System
* Training
ML System
* Training* Evaluation
ML System
* Training* Evaluation* Production
Thanks!
Questions?
Vsevolod Dyomkin@vseloved