multimedia-text team report_2015-07-31

MM text team蔡捷恩莊文立溫鈺瑋

2015@Delta Research Center

Fully automatic F/T matrix analysis from patent data

蔡捷恩

Function/Technology MatrixUsing keyword “autologous cell renewal therapy technology”

“The Patent-Classification Technology/Function Matrix - A Systematic Method for Design Around”, Cheng et al. Mar-2013, CSIR

Problem reduce

• detecting problem/solution pairs in a patent document

“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Problem term detection

• Step1. finding key frames• Step2. feature extraction– Unsupervised feature– Supervised feature

• Step3. classifier training


Step1. key frames detection

• We define key frames to be “atomic noun phrases & their verb phrase expansion”


Step2 – unsupervised feature(language model)

• The model:

Maximize likelihood evaluation(MLE)


Step2 – supervised feature(linguistic model)

• By part-of-speech(POS) statistic on labeled patents


Step2 – supervised feature(linguistic model)

• The model:

Delta function = 1 only when the current key framematches the given pattern


Step3. classifier training

• Simply concatenate the features mention above => LIBSVM


Solution term detection

• Step1. key frame detection• Step2. feature extraction– Unsupervised feature– Supervised feature: based on problem terms

• Step3. classifier training


Problems

• Lacked of labeled data => the linguistic model proposed in the paper seems general enough => believe it directly with porter stemming


Further improvement

• Coreference resolution– “the method solves the problem of overfitting.”

• Semantic based clustering– Okapi BM25 ”The Probabilistic Relevance Framework: BM25 and Beyond”, Robertson et al., 2009

– Word vector “Efficient Estimation of Word Representations in Vector Space” T. Mikolov, ICLR, 2013.

– Document vector “Distributed Representations of Words and Phrases and their Compositionality”,NIPS, 2013.

In my opinion: okapi > word vector > document vector


Thank you

中文領域術語提取

溫鈺瑋

範例×目前此車铣設備由绮發機械提供目前此車铣設備由绮發機械提供

×L 固定板會有擺動過大疑慮L 固定板會有擺動過大疑慮

方法• Collocation– 利用 Mutual information ( 簡稱 MI) 得知「字跟

字」及「詞跟字」搭配成詞的機率 , 詞的內部結合強度

– 例 : c = “ 自然語言處理” , a = “ 自然語言處” b = “ 然語言處理”

方法• Adaptation

目前此車铣設備由绮發機械提供 b e b e s b e s s s b e b e

目前此車铣設備由绮發機械提供

CKIP, stanford, jieba…

手動調整

目前此車铣設備由绮發機械提供 b e s b e b e s b m m e b e

CRF-based DELTA word segmentor

Input : L 固定板會有擺動過大疑慮

Output : L 固定板會有擺動過大疑慮

Thank you

台達資料的知識萃取

莊文立

Information Extraction

• Named Entity Recognition (NER)– 專有名詞的辨識和分類

• 公司、人物、產品、地點…等等

• Relation Extraction (RE)– 從文字裡找出 named entities 之間的關係，例如

• 競爭• 合作• 客戶• 上游廠商

– 通常用 (subject,relation,object) 三元組來表示

SALES 拜訪記錄：對於 BV3418 專案價格的了解，欣特協寶姚經理給出的回應是，周總認為，台達的價格比西門子 808 低階機種 NC 控制器的價格高。

• NER• 西門子 /Organization• 欣特協寶 /Organization• 台達 /Organization• 姚經理 /Person• 周總 /Person

• RE# Subject Relation Object

1 台達 COMPETE_WITH 西門子2 台達 IS_VENDOR 欣特協寶3 西門子 IS_VENDOR 欣特協寶4 欣特協寶 SUBORDINATE 姚經理5 欣特協寶 SUBORDINATE 周總

Named Entity Recognition

• 資料處理– 中文需要良好的斷詞結果– 人工標記

• 模型： Conditional Random Fields (CRF)– 從每個字的特徵裡，學習專有名詞使用的規律

• 本身的詞、詞性• 上下文的詞、詞性• 文法剖析樹• 搭配用法• 稱謂、姓氏• 專有名詞資料庫

Relation Extraction

• 還是需要人工標記 • Deep Learning!

– 讓機器自己發現最適合的表達方法

• Recursive Neural Network– 順著文法剖析樹往上”爬”– 每個字用矩陣 + 向量表示

• 向量表示本身詞義• 矩陣表示上下文資訊

– 兩個 named entity 交會處輸出的向量，放入分類器

[ 1−34⋮5

]●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●

Classifier

Future work

• Cross sentence• Cross document• Cross language

Thank you