alias detection in link data sets master’s thesis paul hsiung

36
Alias Detection in Alias Detection in Link Data Sets Link Data Sets Master’s Thesis Paul Hsiung

Post on 20-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Alias Detection in Alias Detection in Link Data SetsLink Data Sets

Master’s Thesis

Paul Hsiung

Page 2: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Alias DefinitionAlias Definition

Alias of names– Dubya = G.W. Bush– Usama = Osama– G.W.Bush = the President

Osama bin Laden = the Emir, the PrinceMisspelled words

– Unintentional (typos)– Intentional : mortgage = m0rtg@ge (Spam)

Page 3: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

In What Context Do Aliases In What Context Do Aliases Occur?Occur?

Newspaper articlesWebPagesSpam emailsAny collections of text

Page 4: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Link Data SetLink Data Set

A way to represent the contextCompose of set of names and links

– Names are extracted from the text– Names can refer to the same entity (“Dubya”

and “G.W.Bush”)– Links are collection of names and represent a

relationship between names

Page 5: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

ExampleExample

Wanted al-Qaeda terror network chief Osama binLaden and his top aide, Ayman al-Zawahri, haveMoved out of Pakistan and are believed to haveCrossed the mountainous border back intoAfghanistan (Osama bin Laden, Ayman al-Zawahri, al-Qaeda) (Pakistan, Osama bin Laden) (Afghanistan, Osama bin Laden)

Page 6: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Graph RepresentationGraph Representation

Osama

al-Qaeda

Ayman

Pakistan

Afghanistan

Page 7: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

AdvantagesAdvantages

Link data set is easily understood by computers

Mimic the way intelligence communities gather data

Page 8: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Alias DetectionAlias Detection

Given two names in a link data set, are they aliases (i.e. do they refer to the same entity?)

How to measure their alias-ness?Semi-supervised learning

Page 9: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Orthographic MeasuresOrthographic Measures

String edit distance– Minimum number of insertions, deletions, and

substitutions required to transform one name into the other

– SED(Osama, Usama) = 2– SED(Osama, Bush) = 7– Intuitive measure

Page 10: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Some Orthographic MeasuresSome Orthographic Measures

String edit distanceNormalized string edit distanceDiscretized string edit distance

Page 11: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Semantic MeasuresSemantic Measures

But what about aliases such as the Prince and Osama?

Define friends of Osama as people who have occurred in same links with Osama

Through link data sets, number of occurrences of each friend can be collected

Intuition: friends of the Prince look like friends of Osama

Treat friends as probability vectors

Page 12: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Example of FriendsExample of Friendsal-Qaeda

10

5

Islam

CNN2Osama

Page 13: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Comparing Two Friends ListsComparing Two Friends Lists

Osama

al-Qaeda

Music

The Prince

10 2

5 50

Islam

CNN2 8

Page 14: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Some Semantic MeasuresSome Semantic Measures

Dot Product: 10 * 2 + 2 * 8Normalized Dot ProductCommon Friends: 2 (CNN, AlQaeda)KL Distance:

Page 15: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

ClassifierClassifier

So we have a link data setWe have some measures of what aliases areWe can easily hand-pick some examples of

aliasesLet’s build a classifier!

Page 16: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Classifier Training SetClassifier Training Set

Positive examples: hand-pick pairs of names in link data set that are known aliases

Negative examples: randomly pick pairs of names from the same link data set

Calculate measures for all the pairs and insert them as attributes into the training set

Page 17: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Classifier Example:Classifier Example:

Page 18: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Classifier : Cross-ValidationClassifier : Cross-Validation

Experimented with Decision Trees, k-Nearest Neighbors, Naïve Bayes, Support Vector Machines, and Logistic Regression

Logistic Regression performed the best

Page 19: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

PredictionPrediction

Given a query name in the link data set with known aliases

Pair query name with ALL other namesCalculate attributes for all pairsRun each pair through the classifier and

obtain a score (how likely are they to be aliases?)

Page 20: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

ExampleExample

Page 21: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

PredictionPrediction

Use the score to sort the pairs from most likely to be an alias to least likely

See where the true aliases lie in the sorted list and produce a ROC curve

Evaluate classifier based on ROC curve

Page 22: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

SummarySummary

TrainLogisticRegression

Calc Attributes

Calc Attributes

True alias pairs(no query name) Random pairs

Query name

Run Classifier ROC curve

Page 23: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

ROC CurveROC Curve

Start from (0,0) on the graphGo down the sorted listIf the name on the list is a true alias, move

y by one unitIf the name on the list is not a true alias,

move x by one unit

Page 24: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Perfect ROC ExamplePerfect ROC Example

1 2 3

1

2

3

0

name1 name2 true alias? PositionOsama The Prince Yes (0,1)Osama Usama Yes (0,2)Osama The Emir Yes (0,3)Osama Sid No (1,3)Osama Bob No (2,3)Osama John No (3,3)

Page 25: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

ROC ExampleROC Example

1 2 3

1

2

3

0

name1 name2 true alias? PositionOsama The Prince Yes (0,1)Osama Bob no (1,1)Osama Usama Yes (1,2)Osama Sid No (2,2)Osama John No (3,2)Osama The Emir Yes (3,3)

Page 26: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

ROC: NormalizeROC: Normalize

0.3 0.6 1

0.3

0.6

1

0

Balance positive and negative examples

Area under curve(AUC) = 5/9

Able to average multiple curves

Page 27: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Empirical ResultsEmpirical Results

Test on one web page link data set and two spam link data sets

Hand pick aliases for each set

Page 28: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Empirical ResultsEmpirical Results

Choose an alias from the set of hand pick aliases as a query name

Build classifier from other aliases that are not aliases with the query name

Do prediction and obtain ROC curveRepeat for each alias in the set of hand pick

aliasesAverage all ROC curves by normalized axis

Page 29: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

EvaluationEvaluation

We want to know how significant is each group of attributes

Train one classifier with just orthographic attributes

Train another with just semantic attributesTrain a third with both sets of attributesCompare curve and area under curve (AUC)

Page 30: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Terrorist Data SetTerrorist Data Set

Manually extracted from public web pagesNews and articles related to terrorismNames mentioned in the articles are

subjectively linkedUsed 919 alias pairs for training

Page 31: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Web Page ChartWeb Page Chart

Page 32: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Spam Data SetSpam Data Set

Collection of spam emailsFilter out html tagsAll the words are converted to tokens with

white spaces being the boundariesCommon tokens are filtered (e.g. “the” “a”)Each email represents a linkEach link contains tokens from

corresponding email

Page 33: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

ExampleExample

Subject:Mortgage rates as low as 2.95%Ref<suyzvigcffl>ina<swwvvcobadtbo>nce to<shecpgkgffa>day to as low as2.<sppyjukbywvbqc>95% Sa<scqzxytdcua>ve thou<sdzkltzcyry>sa<sefaioubryxkpl>nds of

dol<scarqdscpvibyw>l<sklhxmxbvdr>ars or b<skaavzibaenix>uy the <br>ho<solbbdcqoxpdxcr>me of yo<svesxhobppoy>ur dr<sxjsfyvhhejoldl>eams!<br>

Filtered to:(mortgage, rates, low, refinance, today,

save, thousands, dollars, home, dreams)

Page 34: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Spam I ChartSpam I Chart

Page 35: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

Spam II ChartSpam II Chart

Page 36: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung

ConclusionConclusion

Orthographic measures work wellSemantic sometimes better, sometimes

worse than orthographicCombining them produces the bestFuture work includes adding other measures

such as phonetic string edit distanceLarger question: many aliases to many

names