natural language processing (almost) from scratch（第 6 回 deep learning 勉強会資料; 榊）

Natural Language Processing (Almost) from Scratch

Ronan Collobert et al.Journal of Machine Learning

Research vol.12 (2011)

本論文の選定理由• ACL 2012 Tutorial Deep Learning for NLP にて紹介されて

いる• 代表的な NLP タスクに Deep Learning を適用している

– POS tagging– Chunking– Named Entity Recognition– Semantic Role Labeling

• NLP with Deep Learning の代表的な研究者が執筆している– Chris Manning– Ronan Collobert

本論文のまとめ目的

Propose a unified neural network architecture and learning algorithm that can be applied to various NLP tasks

POS tagging, Chunking, NER, SLR結論

人手で feature を作成する代わりに、大量の labeled/unlabeled training data から internal representation を学習する

本研究の成果は、高精度で低計算コストな freely available tagging system を構築するための基礎となる

本論文のまとめ

注目点

様々な NLP タスクに Neural Network を適用する際に、どのようにデータを扱うべきか

Labeled Data/Unlabeled Data における扱いの違いについて

背景と目的背景

自然言語を構造化されたデータに変換する研究は、 AI 研究の基礎研究であり、数多くの研究が行われてきた

実際には、研究者自身が task-specific feature をengineering することで、 intermediate representation を発見し、 performance を向上させてきた

このような改善は実用的ではあるが、自然言語の理解や AI 構築といった大目的についての知見はほとんど得られない

問題点

背景と目的目的

task-specific engineering せずに、複数の基準手法を超えることを目指す

large unlabeled data sets から発見されるintermediate representation を適用することで、多くの NLP タスクについて高精度を得ることを目指す

Multi-tasking な言語モデルを構築する

Multi Tasking: shared features

タスクとデータセット

タスク説明• Part Of Speech tagging– 各単語、形態素への品詞付与

• Chunking– 名詞句、動詞句、専門用語等文法的にひとま

とまりとして扱われる word sequence の抽出• Named Entity Recognition– 固有名詞抽出（地名、人名など）

タスク説明• Semantic Role Labeling– 文法的役割（主語、目的語、述語）や語同士

の係り受け関係など、意味的や役割を付与する

benchmark systems

Chapter 3 The Networks

提案手法問題設定

全ての NLP タスクは語へのラベル付けであると考える

Traditional Approach

hand-designed features を分類アルゴリズムに適用

New Approach

multilayer neural network による学習

提案手法• Transforming word into Feature Vectors• Extracting Higher Level Features from Word

Feature Vectors• Training • Benchmark Result


Feature Vectors• Training• Benchmark Result•

Neural Networks

提案手法〜概要〜

Window approach network Sentence approach network

Lookup tables の作成

各単語を K 個の discrete feature で表現したMatrix

Extracting Higher Level Features From Word Feature Vectors

L 層の Neural Network

l 層関数

パラメータ

Window approach

の場合

€

t =3,dwi n = 2

€

w11

w12

M

w13

M

w5K−1

w5K

前後の語の特徴ベクトルを連結したものが入力ベクトル

Window approach

Linear LayerWindow approach

Parameters to be trained

€

nhul 第 l 層での hidden unit

数

HardTanh Layer

• Non-linear feature の表現Window approach

Window 　 Approach

Window approachの問題点

SLR タスクにおいてうまく機能しない＝係り受け関係にある語が違う window に含まれてしまう場合があるため

Convolutional Layer

Sentence approach

sentence 全体が入力ベクトル→1 入力の中で、語毎に時間をずらして入力

Time Delay Neural Network

Convolutional Neural Network

Max Layer

Sentence approach

各 hidden unit ごとに t=0 〜 t で最大となる重みを第l 層への重みに

Tagging Schemes


Feature Vectors• Training• Benchmark Result

Training

対数尤度の最大化

Training

Word LevelLog-Likelihood

soft max allover tags

Training

Sentence Level Log-Likelihood

transition score to jump from tag k to tag i

€

Ak,l

Sentence score for a tag path

€

[i ]1T

Training

Sentence LevelLog-Likelihood

Conditional likelihood by normalizing w.r.t all possible paths

Training正規化項は recursive Forward algorithm で算出可能

Inference: Viterbi algorithm (replace logAdd by max)

Pre Processing

• use lower case words in the dictionary• add “caps” feature to words had at least one

non-initial capital letter • number with in a word are replace with the

string “NUMBER”

Hyper-parameters

Benchmark Result

Sentences with similar words should be tagged in the same way. The cat sat on the mat The feline sat on the mat

neighboring words

neighboring words が意味的に関連していない

Chapter 4 Lots of Unlabeled Data

Ranking Language Model

Lots of Unlabeled Data

• Two window approach (11) networks (100HU) trained on two corpus

• LM1– Wikipedia: 631 Mwords– order dictionary words by frequency– increase dictionary size: 5000, 10; 000, 30; 000, 50; 000, 100; 000– 4 weeks of training

• LM2– Wikipedia + Reuter=631+221=852M words– initialized with LM1, dictionary size is 130; 000– 30,000 additional most frequent Reuters words– 3 additional weeks of training

Word Embeddings

neighboring words が意味的に関連している

Benchmark 　 Performance

Chapter 5 Multitask Learning

Multitask Learning

Joint Training

ある訓練データに対し、同一のパターンを用いて異なるラベリング結果を得る

Multitask Learning

window approach では、 First Layer のパラメータを共有sentence approach では、 Convolutional Layer を共有

Joint Training

Multitask Learning

Joint Training

Chapter 6 Temptation

その他の工夫• Suffix Features– Use last two characters as feature

• Gazetters– 8,000 locations, person names, organizations and

misc entries from CoNLL 2003• POS– use POS as a feature for CHUNK & NER

• CHUNK– use CHUNK as a feature for SRL

その他の工夫

その他の工夫

異なるパラメータで 10 個の Neural Network を作成→ 各タスクの精度を検証

Conclusion

• Achievements– “All purpose" neural network architecture for NLP tagging– Limit task-specic engineering– Rely on very large unlabeled datasets– We do not plan to stop here

• Critics– Why forgetting NLP expertise for neural network training skills?

• NLP goals are not limited to existing NLP task• Excessive task-specic engineering is not desirable

– Why neural networks?• Scale on massive datasets• Discover hidden representations• Most of neural network technology existed in 1997 (Bottou, 1997)

natural language processing (almost) from scratch（第 6 回 deep learning 勉強会資料; 榊）

Technology