multimedia-text team report_2015-07-31

27
MM text team 蔡蔡蔡 蔡蔡蔡 蔡蔡蔡 2015@Delta Research Center

Upload: -

Post on 14-Aug-2015

10 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Multimedia-text team report_2015-07-31

MM text team蔡捷恩莊文立溫鈺瑋

2015@Delta Research Center

Page 2: Multimedia-text team report_2015-07-31

Fully automatic F/T matrix analysis from patent data

蔡捷恩

Page 3: Multimedia-text team report_2015-07-31

Function/Technology MatrixUsing keyword “autologous cell renewal therapy technology”

“The Patent-Classification Technology/Function Matrix - A Systematic Method for Design Around”, Cheng et al. Mar-2013, CSIR

Page 4: Multimedia-text team report_2015-07-31

Problem reduce

• detecting problem/solution pairs in a patent document

“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Page 5: Multimedia-text team report_2015-07-31

Problem term detection

• Step1. finding key frames• Step2. feature extraction– Unsupervised feature– Supervised feature

• Step3. classifier training

“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Page 6: Multimedia-text team report_2015-07-31

Step1. key frames detection

• We define key frames to be “atomic noun phrases & their verb phrase expansion”

“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Page 7: Multimedia-text team report_2015-07-31

Step2 – unsupervised feature(language model)

• The model:

Maximize likelihood evaluation(MLE)

“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Page 8: Multimedia-text team report_2015-07-31

Step2 – supervised feature(linguistic model)

• By part-of-speech(POS) statistic on labeled patents

“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Page 9: Multimedia-text team report_2015-07-31

Step2 – supervised feature(linguistic model)

• The model:

Delta function = 1 only when the current key framematches the given pattern

“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Page 10: Multimedia-text team report_2015-07-31

Step3. classifier training

• Simply concatenate the features mention above => LIBSVM

“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Page 11: Multimedia-text team report_2015-07-31

Solution term detection

• Step1. key frame detection• Step2. feature extraction– Unsupervised feature– Supervised feature: based on problem terms

• Step3. classifier training

“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Page 12: Multimedia-text team report_2015-07-31

Problems

• Lacked of labeled data => the linguistic model proposed in the paper seems general enough => believe it directly with porter stemming

“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Page 13: Multimedia-text team report_2015-07-31

Further improvement

• Coreference resolution– “the method solves the problem of overfitting.”

• Semantic based clustering– Okapi BM25 ”The Probabilistic Relevance Framework: BM25 and Beyond”, Robertson et al., 2009

– Word vector “Efficient Estimation of Word Representations in Vector Space” T. Mikolov, ICLR, 2013.

– Document vector “Distributed Representations of Words and Phrases and their Compositionality”,NIPS, 2013.

In my opinion: okapi > word vector > document vector

“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Page 14: Multimedia-text team report_2015-07-31

Thank you

Page 15: Multimedia-text team report_2015-07-31

中文領域術語提取

溫鈺瑋

Page 16: Multimedia-text team report_2015-07-31

範例×目前 此車 铣 設備 由 绮 發 機械 提供目前 此 車铣 設備 由 绮發機械 提供

×L 固定 板會 有 擺動 過大 疑慮L 固定板 會 有 擺動 過大 疑慮

Page 17: Multimedia-text team report_2015-07-31

方法• Collocation– 利用 Mutual information ( 簡稱 MI) 得知「字跟

字」及「詞跟字」搭配成詞的機率 , 詞的內部結合強度

– 例 : c = “ 自然語言處理” , a = “ 自然語言處” b = “ 然語言處理”

Page 18: Multimedia-text team report_2015-07-31

方法• Adaptation

目前 此車 铣 設備 由 绮 發 機械 提供 b e b e s b e s s s b e b e

目前此車铣設備由绮發機械提供

CKIP, stanford, jieba…

手動調整

目前 此 車铣 設備 由 绮發機械 提供 b e s b e b e s b m m e b e

CRF-based DELTA word segmentor

Input : L 固定 板會 有 擺動 過大 疑慮

Output : L 固定板 會 有 擺動 過大 疑慮

Page 19: Multimedia-text team report_2015-07-31

Thank you

Page 20: Multimedia-text team report_2015-07-31

台達資料的知識萃取

莊文立

Page 21: Multimedia-text team report_2015-07-31

Information Extraction

• Named Entity Recognition (NER)– 專有名詞的辨識和分類

• 公司、人物、產品、地點…等等

• Relation Extraction (RE)– 從文字裡找出 named entities 之間的關係,例如

• 競爭• 合作• 客戶• 上游廠商

– 通常用 (subject,relation,object) 三元組來表示

Page 22: Multimedia-text team report_2015-07-31

SALES 拜訪記錄:對於 BV3418 專案價格的了解,欣特協寶姚經理給出的回應是,周總認為,台達的價格比西門子 808 低階機種 NC 控制器的價格高。

• NER• 西門子 /Organization• 欣特協寶 /Organization• 台達 /Organization• 姚經理 /Person• 周總 /Person

• RE# Subject Relation Object

1 台達 COMPETE_WITH 西門子2 台達 IS_VENDOR 欣特協寶3 西門子 IS_VENDOR 欣特協寶4 欣特協寶 SUBORDINATE 姚經理5 欣特協寶 SUBORDINATE 周總

Page 23: Multimedia-text team report_2015-07-31

Named Entity Recognition

• 資料處理– 中文需要良好的斷詞結果– 人工標記

• 模型: Conditional Random Fields (CRF)– 從每個字的特徵裡,學習專有名詞使用的規律

• 本身的詞、詞性• 上下文的詞、詞性• 文法剖析樹• 搭配用法• 稱謂、姓氏• 專有名詞資料庫

Page 24: Multimedia-text team report_2015-07-31

Relation Extraction

• 還是需要人工標記 • Deep Learning!

– 讓機器自己發現最適合的表達方法

• Recursive Neural Network– 順著文法剖析樹往上”爬”– 每個字用 矩陣 + 向量 表示

• 向量表示本身詞義• 矩陣表示上下文資訊

– 兩個 named entity 交會處輸出的向量,放入分類器

[ 1−34⋮5

]●●

●●●●

Page 25: Multimedia-text team report_2015-07-31

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●

Classifier

Page 26: Multimedia-text team report_2015-07-31

Future work

• Cross sentence• Cross document• Cross language

Page 27: Multimedia-text team report_2015-07-31

Thank you