multimedia-text team report_2015-07-31
TRANSCRIPT
MM text team蔡捷恩莊文立溫鈺瑋
2015@Delta Research Center
Fully automatic F/T matrix analysis from patent data
蔡捷恩
Function/Technology MatrixUsing keyword “autologous cell renewal therapy technology”
“The Patent-Classification Technology/Function Matrix - A Systematic Method for Design Around”, Cheng et al. Mar-2013, CSIR
Problem reduce
• detecting problem/solution pairs in a patent document
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Problem term detection
• Step1. finding key frames• Step2. feature extraction– Unsupervised feature– Supervised feature
• Step3. classifier training
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Step1. key frames detection
• We define key frames to be “atomic noun phrases & their verb phrase expansion”
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Step2 – unsupervised feature(language model)
• The model:
Maximize likelihood evaluation(MLE)
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Step2 – supervised feature(linguistic model)
• By part-of-speech(POS) statistic on labeled patents
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Step2 – supervised feature(linguistic model)
• The model:
Delta function = 1 only when the current key framematches the given pattern
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Step3. classifier training
• Simply concatenate the features mention above => LIBSVM
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Solution term detection
• Step1. key frame detection• Step2. feature extraction– Unsupervised feature– Supervised feature: based on problem terms
• Step3. classifier training
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Problems
• Lacked of labeled data => the linguistic model proposed in the paper seems general enough => believe it directly with porter stemming
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Further improvement
• Coreference resolution– “the method solves the problem of overfitting.”
• Semantic based clustering– Okapi BM25 ”The Probabilistic Relevance Framework: BM25 and Beyond”, Robertson et al., 2009
– Word vector “Efficient Estimation of Word Representations in Vector Space” T. Mikolov, ICLR, 2013.
– Document vector “Distributed Representations of Words and Phrases and their Compositionality”,NIPS, 2013.
In my opinion: okapi > word vector > document vector
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC
Thank you
中文領域術語提取
溫鈺瑋
範例×目前 此車 铣 設備 由 绮 發 機械 提供目前 此 車铣 設備 由 绮發機械 提供
×L 固定 板會 有 擺動 過大 疑慮L 固定板 會 有 擺動 過大 疑慮
方法• Collocation– 利用 Mutual information ( 簡稱 MI) 得知「字跟
字」及「詞跟字」搭配成詞的機率 , 詞的內部結合強度
– 例 : c = “ 自然語言處理” , a = “ 自然語言處” b = “ 然語言處理”
方法• Adaptation
目前 此車 铣 設備 由 绮 發 機械 提供 b e b e s b e s s s b e b e
目前此車铣設備由绮發機械提供
CKIP, stanford, jieba…
手動調整
目前 此 車铣 設備 由 绮發機械 提供 b e s b e b e s b m m e b e
CRF-based DELTA word segmentor
Input : L 固定 板會 有 擺動 過大 疑慮
Output : L 固定板 會 有 擺動 過大 疑慮
Thank you
台達資料的知識萃取
莊文立
Information Extraction
• Named Entity Recognition (NER)– 專有名詞的辨識和分類
• 公司、人物、產品、地點…等等
• Relation Extraction (RE)– 從文字裡找出 named entities 之間的關係,例如
• 競爭• 合作• 客戶• 上游廠商
– 通常用 (subject,relation,object) 三元組來表示
SALES 拜訪記錄:對於 BV3418 專案價格的了解,欣特協寶姚經理給出的回應是,周總認為,台達的價格比西門子 808 低階機種 NC 控制器的價格高。
• NER• 西門子 /Organization• 欣特協寶 /Organization• 台達 /Organization• 姚經理 /Person• 周總 /Person
• RE# Subject Relation Object
1 台達 COMPETE_WITH 西門子2 台達 IS_VENDOR 欣特協寶3 西門子 IS_VENDOR 欣特協寶4 欣特協寶 SUBORDINATE 姚經理5 欣特協寶 SUBORDINATE 周總
Named Entity Recognition
• 資料處理– 中文需要良好的斷詞結果– 人工標記
• 模型: Conditional Random Fields (CRF)– 從每個字的特徵裡,學習專有名詞使用的規律
• 本身的詞、詞性• 上下文的詞、詞性• 文法剖析樹• 搭配用法• 稱謂、姓氏• 專有名詞資料庫
Relation Extraction
• 還是需要人工標記 • Deep Learning!
– 讓機器自己發現最適合的表達方法
• Recursive Neural Network– 順著文法剖析樹往上”爬”– 每個字用 矩陣 + 向量 表示
• 向量表示本身詞義• 矩陣表示上下文資訊
– 兩個 named entity 交會處輸出的向量,放入分類器
[ 1−34⋮5
]●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●●●
●●
●●
Classifier
Future work
• Cross sentence• Cross document• Cross language
Thank you