natural language processing (almost) from scratch(第 6 回 deep learning 勉強会資料; 榊)

54
Natural Language Processing (Almost) from Scratch Ronan Collobert et al. Journal of Machine Learning Research vol.12 (2011)

Upload: shohei-ohsawa

Post on 29-May-2015

2.294 views

Category:

Technology


3 download

DESCRIPTION

Deep Learning Japan @ 東大です http://www.facebook.com/DeepLearning https://sites.google.com/site/deeplearning2013/

TRANSCRIPT

Page 1: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Natural Language Processing (Almost) from Scratch

Ronan Collobert et al.Journal of Machine Learning

Research vol.12 (2011)

Page 2: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

本論文の選定理由• ACL 2012 Tutorial Deep Learning for NLP にて紹介されて

いる• 代表的な NLP タスクに Deep Learning を適用している

– POS tagging– Chunking– Named Entity Recognition– Semantic Role Labeling

• NLP with Deep Learning の代表的な研究者が執筆している– Chris Manning– Ronan Collobert

Page 3: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

本論文のまとめ目的

Propose a unified neural network architecture and learning algorithm that can be applied to various NLP tasks

POS tagging, Chunking, NER, SLR結論

人手で feature を作成する代わりに、大量の labeled/unlabeled training data から internal representation を学習する

本研究の成果は、高精度で低計算コストな freely available tagging system を構築するための基礎となる

Page 4: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

本論文のまとめ

注目点

様々な NLP タスクに Neural Network を適用する際に、どのようにデータを扱うべきか

Labeled Data/Unlabeled Data における扱いの違いについて

Page 5: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

背景と目的背景

自然言語を構造化されたデータに変換する研究は、 AI 研究の基礎研究であり、数多くの研究が行われてきた

実際には、研究者自身が task-specific feature をengineering することで、 intermediate representation を発見し、 performance を向上させてきた

このような改善は実用的ではあるが、自然言語の理解や AI 構築といった大目的についての知見はほとんど得られない

問題点

Page 6: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

背景と目的目的

task-specific engineering せずに、複数の基準手法を超えることを目指す

large unlabeled data sets から発見されるintermediate representation を適用することで、多くの NLP タスクについて高精度を得ることを目指す

Multi-tasking な言語モデルを構築する

Page 7: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Multi Tasking: shared features

Page 8: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

タスクとデータセット

Page 9: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

タスク説明• Part Of Speech tagging– 各単語、形態素への品詞付与

• Chunking– 名詞句、動詞句、専門用語等文法的にひとま

とまりとして扱われる word sequence の抽出• Named Entity Recognition– 固有名詞抽出(地名、人名など)

Page 10: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

タスク説明• Semantic Role Labeling– 文法的役割(主語、目的語、述語)や語同士

の係り受け関係など、意味的や役割を付与する

Page 11: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

benchmark systems

Page 12: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Chapter 3 The Networks

Page 13: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

提案手法問題設定

全ての NLP タスクは語へのラベル付けであると考える

Traditional Approach

hand-designed features を分類アルゴリズムに適用

New Approach

multilayer neural network による学習

Page 14: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

提案手法• Transforming word into Feature Vectors• Extracting Higher Level Features from Word

Feature Vectors• Training • Benchmark Result

Page 15: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

提案手法• Transforming word into Feature Vectors• Extracting Higher Level Features from Word

Feature Vectors• Training• Benchmark Result•

Page 16: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Neural Networks

Page 17: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

提案手法〜概要〜

Window approach network Sentence approach network

Page 18: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Lookup tables の作成

各単語を K 個の discrete feature で表現したMatrix

Page 19: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

提案手法• Transforming word into Feature Vectors• Extracting Higher Level Features from Word

Feature Vectors• Training • Benchmark Result

Page 20: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Extracting Higher Level Features From Word Feature Vectors

L 層の Neural Network

l 層関数

パラメータ

Page 21: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Window approach

の場合

t =3,dwi n = 2

w11

w12

M

w13

M

w5K−1

w5K

前後の語の特徴ベクトルを連結したものが入力ベクトル

Window approach

Page 22: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Linear LayerWindow approach

Parameters to be trained

nhul 第 l 層での hidden unit

Page 23: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

HardTanh Layer

• Non-linear feature の表現Window approach

Page 24: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Window   Approach

Window approachの問題点

SLR タスクにおいてうまく機能しない=係り受け関係にある語が違う window に含まれてしまう場合があるため

Page 25: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Convolutional Layer

Sentence approach

sentence 全体が入力ベクトル→1 入力の中で、語毎に時間をずらして入力

Page 26: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Time Delay Neural Network

Page 27: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Convolutional Neural Network

Page 28: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Max Layer

Sentence approach

各 hidden unit ごとに t=0 〜 t で最大となる重みを第l 層への重みに

Page 29: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Tagging Schemes

Page 30: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

提案手法• Transforming word into Feature Vectors• Extracting Higher Level Features from Word

Feature Vectors• Training• Benchmark Result

Page 31: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Training

対数尤度の最大化

Page 32: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Training

Word LevelLog-Likelihood

soft max allover tags

Page 33: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Training

Sentence Level Log-Likelihood

transition score to jump from tag k to tag i

Ak,l

Sentence score for a tag path

[i ]1T

Page 34: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Training

Sentence LevelLog-Likelihood

Conditional likelihood by normalizing w.r.t all possible paths

Page 35: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Training正規化項は recursive Forward algorithm で算出可能

Inference: Viterbi algorithm (replace logAdd by max)

Page 36: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

提案手法• Transforming word into Feature Vectors• Extracting Higher Level Features from Word

Feature Vectors• Training • Benchmark Result

Page 37: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Pre Processing

• use lower case words in the dictionary• add “caps” feature to words had at least one

non-initial capital letter • number with in a word are replace with the

string “NUMBER”

Page 38: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Hyper-parameters

Page 39: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Benchmark Result

Sentences with similar words should be tagged in the same way. The cat sat on the mat The feline sat on the mat

Page 40: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

neighboring words

neighboring words が意味的に関連していない

Page 41: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Chapter 4 Lots of Unlabeled Data

Page 42: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Ranking Language Model

Page 43: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Lots of Unlabeled Data

• Two window approach (11) networks (100HU) trained on two corpus

• LM1– Wikipedia: 631 Mwords– order dictionary words by frequency– increase dictionary size: 5000, 10; 000, 30; 000, 50; 000, 100; 000– 4 weeks of training

• LM2– Wikipedia + Reuter=631+221=852M words– initialized with LM1, dictionary size is 130; 000– 30,000 additional most frequent Reuters words– 3 additional weeks of training

Page 44: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Word Embeddings

neighboring words が意味的に関連している

Page 45: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Benchmark   Performance

Page 46: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Chapter 5 Multitask Learning

Page 47: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Multitask Learning

Joint Training

ある訓練データに対し、同一のパターンを用いて異なるラベリング結果を得る

Page 48: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Multitask Learning

window approach では、 First Layer のパラメータを共有sentence approach では、 Convolutional Layer を共有

Joint Training

Page 49: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Multitask Learning

Joint Training

Page 50: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Chapter 6 Temptation

Page 51: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

その他の工夫• Suffix Features– Use last two characters as feature

• Gazetters– 8,000 locations, person names, organizations and

misc entries from CoNLL 2003• POS– use POS as a feature for CHUNK & NER

• CHUNK– use CHUNK as a feature for SRL

Page 52: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

その他の工夫

Page 53: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

その他の工夫

異なるパラメータで 10 個の Neural Network を作成→ 各タスクの精度を検証

Page 54: Natural Language Processing (Almost) from Scratch(第 6 回 Deep Learning 勉強会資料; 榊)

Conclusion

• Achievements– “All purpose" neural network architecture for NLP tagging– Limit task-specic engineering– Rely on very large unlabeled datasets– We do not plan to stop here

• Critics– Why forgetting NLP expertise for neural network training skills?

• NLP goals are not limited to existing NLP task• Excessive task-specic engineering is not desirable

– Why neural networks?• Scale on massive datasets• Discover hidden representations• Most of neural network technology existed in 1997 (Bottou, 1997)