address standardization with latent semantic association

21
Address Standardization with Latent Semantic Association Author : Honglei Guo, Huijia Zhu, Zhili Guo, XiaoXun Zhang, and Zhong Su Publication : KDD’09 Advisor : Chia-Hui Chang Presenter : Chia-Yi Huang 2010/08/12 1

Upload: jyhuangtc

Post on 17-Dec-2014

672 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Address standardization with latent semantic association

1

Address Standardization with Latent Semantic

AssociationAuthor : Honglei Guo, Huijia Zhu, Zhili Guo, XiaoXun Zhang, and Zhong Su

Publication : KDD’09Advisor : Chia-Hui ChangPresenter : Chia-Yi Huang

2010/08/12

Page 2: Address standardization with latent semantic association

2

Introduction Related Works Latent Semantic Association Method Address Standardization Using LASA Model

and Informative Sampling Experiments Conclusions

Outline

Page 3: Address standardization with latent semantic association

3

IntorductionMotivationApproachesRelated Works

Page 4: Address standardization with latent semantic association

4

Address data are highly irregular ◦ most of them are often generated by different

people at different times.

Address should be converted to a standard consistent format.◦ Ex: “1101 Kitchawan Road, Route 134, Yorktown

Heights, N.Y. 10598”◦ [House No. : 1101], [Street : Kitchawan Road],

[Route : Route 134], [City : Yorktown Heights], [State: N.Y. ], [Zip :10598]

Introduction

Page 5: Address standardization with latent semantic association

5

Latent semantic association (LaSA)◦ To minimize human efforts and augment the

size of labeled training data set.

Address Standardization model is learned form LaSA features and informative samples.

Introduction (cont.)

Page 6: Address standardization with latent semantic association

6

Latent Semantic Association Method

Virtual Context DocumentLearning LaSA Model

Page 7: Address standardization with latent semantic association

7

In order to minimize the human efforts, we expect use ps(x, y) to approximate pt(x, y).◦ X : feature space to represent word instances.◦ Y : set of semantic labels.◦ ps(x, y), pt(x, y) : the underlying distribution for the labeled

training data set and the target data set.

LaSA model θs,t to capture latent semantic association among words form the unlabeled domain data.◦ Better augments the training data set.◦ Enhance the estimate of the distribution to better

approximate the real domain distribution.

Latent Semantic Association Model

Page 8: Address standardization with latent semantic association

8

Virtual Context Document◦ Given a word xi , virtual context document of xi is

◦ F(xiSk) : context feature set of xi in the address sample sk,

1≤k≤n.◦ n : total number of the samples which contain xi in the corpus.

Learning LaSA Model form Virtual Context Documents

Page 9: Address standardization with latent semantic association

9

Given vdxi = {f1, …, fj, …, fm} Weight(fi, xi) = log2 {P(fj, xi) / P(fj)P(xi)}

Learning LaSA Model form Virtual Context Documents (cont.)

Page 10: Address standardization with latent semantic association

10

Learning LaSA Model Latent dirichlet

allocation(LDA) imposes a dirichlet distrubution on the topic mixture weights corresponding to the documents in the corpus.

Page 11: Address standardization with latent semantic association

11

Learning LaSA Model (cont.)

Page 12: Address standardization with latent semantic association

12

Address Standardization Using LaSA Model and Informative Sampling

RRM ClassifierLatent Semantic Association FeatureInformative Sampling

Page 13: Address standardization with latent semantic association

13

View address standardization as a sequential classification problem.◦ Employs Robust Risk Minimization(RRM) Classifier.

Latent Semantic Association Feature◦ Frequency : 10◦ Number of topic N : 50◦ Context view window size : {-3 , 3}

Address Standardization Using LaSA Model

Page 14: Address standardization with latent semantic association

14

Informative sample selection method use a variant of uncertainty-sampling.

More uncertain fragments ate contained in the sample, more informative the sample is.

Given an address sample Si = {tokj}Nj=1,

◦ Tokj : jth token unit in Si

Confidence score of Si :◦ Score(tokj) : confidence score of tokj in Si

◦ TokNum(Si) : total number of token units in Si

◦ UncNum(Si) : the number of uncertain units in Si

Token units with lower confidence score(i.e. Score(tokj) ≤ α) are considered as uncertain units.

Informative Sampling

Page 15: Address standardization with latent semantic association

15

Informative Sampling (cont.)

Page 16: Address standardization with latent semantic association

16

Data set

Experiments

Page 17: Address standardization with latent semantic association

17

Performance Enhancement by LaSA model◦ Relative F-measure enhancement◦ Relative Error Reduction

Experiments(cont.)

Page 18: Address standardization with latent semantic association

18

Training Data Reduction by LaSA Feature

Experiments(cont.)

Page 19: Address standardization with latent semantic association

19

Cumulative impact of LaSA model and informative sampling

Page 20: Address standardization with latent semantic association

20

Cumulative impact of LaSA model and informative sampling

Page 21: Address standardization with latent semantic association

21

LaSA-Info method achieves more than 45% reduction in error over the state-of-the-art RRM trained on the same material.

Compared to the supervised learning method, the present approach requires only 5% as much annotated data to achieve the same level of performance.

Conclusions