automating the detection of anomalies and trends from text...

Automating the Detection of Anomalies andTrends from TextNGDM’07 Workshop

Baltimore, MD

Michael W. Berry

Department of Electrical Engineering & Computer ScienceUniversity of Tennessee

October 11, 2007

1 / 40

1 Nonnegative Matrix Factorization (NNMF)MotivationUnderlying Optimization ProblemMM Method (Lee and Seung)Smoothing and Sparsity ConstraintsHybrid NNMF Approach

2 Anomaly Detection in ASRS CollectionDocument Parsing and Term WeightingPreliminary TrainingSDM07 Contest Performance

3 NNTF Classification of Enron EmailCorpus and Historical EventsDiscussion Tracking via PARAFAC/Tensor FactorizationMultidimensional Data AnalysisPARAFAC Model

4 References

2 / 40

NNMF Origins

NNMF (Nonnegative Matrix Factorization) can be used toapproximate high-dimensional data having nonnegativecomponents.

Lee and Seung (1999) demonstrated its use as a sum-by-partsrepresentation of image data in order to both identify andclassify image features.

Xu et al. (2003) demonstrated how NNMF-based indexingcould outperform SVD-based Latent Semantic Indexing (LSI)for some information retrieval tasks.

3 / 40

NNMF for Image Processing

SVD

W H

Ai

i

U ViΣ

Sparse NNMF verses Dense SVD Bases; Lee and Seung (1999)

4 / 40

NNMF Analogue for Text Mining (Medlars)

0 1 2 3 4

10987654321 ventricular

aorticseptalleftdefectregurgitationventriclevalvecardiacpressure

Highest Weighted Terms in Basis Vector W*1

weight

term

0 0.5 1 1.5 2 2.5

10987654321 oxygen

flowpressurebloodcerebralhypothermiafluidvenousarterialperfusion


weight

term

0 1 2 3 4

10987654321 children

childautisticspeechgroupearlyvisualanxietyemotionalautism


weight

term

0 0.5 1 1.5 2 2.5

10987654321 kidney

marrowdnacellsnephrectomyunilaterallymphocytesbonethymidinerats


weight

term

Interpretable NNMF feature vectors; Langville et al. (2006)

5 / 40

Derivation

Given an m × n term-by-document (sparse) matrix X .

Compute two reduced-dim. matrices W ,H so that X 'WH;W is m × r and H is r × n, with r � n.

Optimization problem:

minW ,H‖X −WH‖2F ,

subject to Wij ≥ 0 and Hij ≥ 0, ∀i , j .General approach: construct initial estimates for W and Hand then improve them via alternating iterations.

6 / 40

Minimization Challenges and Formulations[Berry et al., 2007]

Local Minima: Non-convexity of functionalf (W ,H) = 1

2‖X −WH‖2F in both W and H.

Non-unique Solutions: WDD−1H is nonnegative for anynonnegative (and invertible) D.

Many NNMF Formulations:Lee and Seung (2001) – information theoretic formulationbased on Kullback-Leibler divergence of X from WH.Guillamet, Bressan, and Vitria (2001) – diagonal weight matrixQ used (XQ ≈WHQ) to compensate for feature redundancy(columns of W ).Wang, Jiar, Hu, and Turk (2004) – constraint-basedformulation using Fisher linear discriminant analysis to improveextraction of spatially localized features.Other Cost Function Formulations – Hamza and Brady (2006),Dhillon and Sra (2005), Cichocki, Zdunek, and Amari (2006)

7 / 40

Multiplicative Method (MM)

Multiplicative update rules for W and H (Lee and Seung,1999):

1 Initialize W and H with nonnegative values, and scale thecolumns of W to unit norm.

2 Iterate for each c , j , and i until convergence or after kiterations:

1 Hcj ← Hcj(W TX )cj

(W TWH)cj + ε

2 Wic ←Wic(XHT )ic

(WHHT )ic + ε3 Scale the columns of W to unit norm.

Setting ε = 10−9 will suffice to avoid division by zero[Shahnaz et al., 2006].

8 / 40

Multiplicative Method (MM) contd.

Multiplicative Update MATLAB R©Code for NNMF

W = rand(m,k); % W initially random

H = rand(k,n); % H initially random

for i = 1 : maxiterH = H .* (WTA) ./ (WTWH + ε);W = W .* (AHT) ./ (WHHT + ε);

end

9 / 40

Lee and Seung MM Convergence

Convergence: when the MM algorithm converges to a limitpoint in the interior of the feasible region, the point is astationary point. The stationary point may or may not be alocal minimum. If the limit point lies on the boundary of thefeasible region, one cannot determine its stationarity[Berry et al., 2007].

Several modifications have been proposed: Gonzalez andZhang (2005) accelerated convergence somewhat butstationarity issue remains; Lin (2005) modified the algorithmto guarantee convergence to a stationary point; Dhillon andSra (2005) derived update rules that incorporate weights forthe importance of certain features of the approximation.

10 / 40

Hoyer’s Method

From neural network applications, Hoyer (2002) enforcedstatistical sparsity for the weight matrix H in order to enhancethe parts-based data representations in the matrix W .

Mu et al. (2003) suggested a regularization approach toachieve statistical sparsity in the matrix H: point countregularization; penalize the number of nonzeros in H ratherthan

∑ij Hij .

Goal of increased sparsity (or smoothness) – betterrepresentation of parts or features spanned by the corpus (X )[Berry and Browne, 2005].

11 / 40

GD-CLS – Hybrid Approach

First use MM to compute an approximation to W for eachiteration – a gradient descent (GD) optimization step.

Then, compute the weight matrix H using a constrained leastsquares (CLS) model to penalize non-smoothness (i.e.,non-sparsity) in H – common Tikohonov regularizationtechnique used in image processing (Prasad et al., 2003).

Convergence to a non-stationary point evidenced (proof stillneeded).

12 / 40

GD-CLS Algorithm

1 Initialize W and H with nonnegative values, and scale thecolumns of W to unit norm.

2 Iterate until convergence or after k iterations:

1 Wic ←Wic(XHT )ic

(WHHT )ic + ε, for c and i

2 Rescale the columns of W to unit norm.3 Solve the constrained least squares problem:

minHj

{‖Xj −WHj‖22 + λ‖Hj‖22},

where the subscript j denotes the j th column, for j = 1, . . . ,m.

Any negative values in Hj are set to zero. The parameter λ isa regularization value that is used to balance the reduction ofthe metric ‖Xj −WHj‖22 with enforcement of smoothness andsparsity in H.

13 / 40

Two Penalty Term Formulation

Introduce smoothing on Wk (feature vectors) in addition toHk :

minW ,H{‖X −WH‖2F + α‖W ‖2F + β‖H‖2F},

where ‖ · ‖F is the Frobenius norm.

Constrained NNMF (CNMF) iteration:

Hcj ← Hcj(W TX )cj − βHcj

(W TWH)cj + ε

Wic ←Wic(XHT )ic − αWic

(WHHT )ic + ε

14 / 40

Improving Feature Interpretability

Gauging Parameters for Constrained Optimization

How sparse (or smooth) should factors (W ,H) be to produce asmany interpretable features as possible?

To what extent do different norms (l1, l2, l∞) improve/degradatefeature quality or span? At what cost?

Can a nonnegative feature space be built from objects in bothimages and text? Are there opportunities for multimodal documentsimilarity?

15 / 40

Anomaly Detection (ASRS)

Classify events described by documents from the AirlineSafety Reporting System (ASRS) into 22 anomaly categories;contest from SDM07 Text Mining Workshop.

General Text Parsing (GTP) Software Environment in C++[Giles et al., 2003] used to parse both ASRS training set and acombined ASRS training and test set:

Dataset Terms ASRS DocumentsTraining 15,722 21,519

Training+Test 17,994 28,596 (7,077)

Global and document frequency of required to be at least 2;stoplist of 493 common words used; char length of any term∈ [2, 200].

Download Information:GTP: http://www.cs.utk.edu/∼lsi

ASRS: http://www.cs.utk.edu/tmw07

16 / 40

Initialization Schematic

1 n

H

Feat

ures

Documents

H

(Filtered)

1 nDocuments

Feat

ures

R T

k1 1

k

17 / 40

Anomaly to Feature Mapping and Scoring Schematic

kk

11Fe

atur

es

Feat

ures

TT 1

Extract Anomalies Per Feature

R

H H

Documents in RAnomaly 1

1 n

22

18 / 40

Training/Testing Performance (ROC Curves)

Best/Worst ROC curves (False Positive Rate versus TruePositive Rate)

ROC AreaAnomaly Type (Description) Training Contest

22 Security Concern/Threat .9040 .89255 Incursion (collision hazard) .8977 .87164 Excursion (loss of control) .8296 .7159

21 Illness/Injury Event .8201 .817212 Traffic Proximity Event .7954 .77517 Altitude Deviation .7931 .8085

18 Aircraft Damage/Encounter .7250 .726111 Terrain Proximity Event .7234 .75759 Speed Deviation .7060 .6893

10 Uncommanded (loss of control) .6784 .650413 Weather Issue .6287 .60182 Noncompliance (policy/proc.) .6009 .5551

19 / 40

Anomaly Summarization Prototype - Sentence Ranking

Sentence rank = f(global term weights) – B. Lamb

20 / 40

Improving Summarization and Steering

What versus why:

Extraction of textual concepts still requires human interpretation(in the absence of ontologies or domain-specific classifications).

How can previous knowledge or experience be captured for featurematching (or pruning)?

To what extent can feature vectors be annotated for future use oras the text collection is updated? What is the cost for updatingthe NNMF (or similar) model?

21 / 40

Unresolved Modeling Issues

Parameters and dimensionality:

Further work needed in determining effects of alternative termweighting schemes (for X ) and choices of control parameters(e.g., α, β for CNMF).

How does document (or object) clustering change with differentranks (or features)?

How should feature vectors from competing models (Bayesian,neural nets, etc.) be compared in both interpretability andcomputational cost?

22 / 40

Email Collection

By-product of the FERC investigation of Enron (originallycontained 15 million email messages).

This study used the improved corpus known as the EnronEmail set, which was edited by Dr. William Cohen at CMU.

This set had over 500,000 email messages. The majority weresent in the 1999 to 2001 timeframe.

23 / 40

Enron Historical 1999-2001

Ongoing, problematic, development of the Dabhol PowerCompany (DPC) in the Indian state of Maharashtra.

Deregulation of the Calif. energy industry, which led to rollingelectricity blackouts in the summer of 2000 (and subsequentinvestigations).

Revelation of Enron’s deceptive business and accountingpractices that led to an abrupt collapse of the energy colossusin October, 2001; Enron filed for bankruptcy in December,2001.

24 / 40

Multidimensional Data Analysis via PARAFACNew Paradigm:Multidimensional Data Analysis

+ + ...Third dimension offers more explanatory power: uncovers new

latent information and reveals subtle relationships

Build a 3-way array such that there is a term-author matrix for each month.

PARAFAC

Multilinearalgebra

term-authormatrix

term-author-montharray

Email graph

+ + ...

NonnegativePARAFAC

25 / 40

Temporal Assessment via PARAFAC

Objective

Use PARAFAC to analyze content of email communications over time

David

Ellen

Bob

Frank

Alice Carl

IngridHenk

Gary

term-author-montharray

+ + ...=

26 / 40

Mathematical Notation

Kronecker product

A⊗ B =

A11B · · · A1nB...

. . ....

Am1B · · · AmnB

Khatri-Rao product (columnwise Kronecker)

A� B =(A1 ⊗ B1 · · · An ⊗ Bn

)Outer product

A1 ◦ B1 =

A11B11 · · · A11Bm1...

. . ....

Am1B11 · · · Am1Bm1

27 / 40

PARAFAC Representations

PARAllel FACtors (Harshman, 1970)

Also known as CANDECOMP (Carroll & Chang, 1970)

Typically solved by Alternating Least Squares (ALS)

Alternative PARAFAC formulations

Xijk ≈r∑

i=1

AirBjrCkr

X ≈r∑

i=1

Ai ◦ Bi ◦ Ci , where X is a 3-way array (tensor).

Xk ≈ A diag(Ck:) BT , where Xk is a tensor slice.

X I×JK ≈ A(C � B)T , where X is matricized.

28 / 40

PARAFAC (Visual) Representations

+ + ...

Scalar form Outer product form

Tensor slice form Matrix form

=

29 / 40

Nonnegative PARAFAC Algorithm

Adapted from (Mørup, 2005) and based on NNMF by (Leeand Seung, 2001)

||X I×JK − A(C � B)T ||F = ||X J×IK − B(C � A)T ||F= ||XK×IJ − C (B � A)T ||F

Minimize over A, B, C using multiplicative update rule:

Aiρ ← Aiρ(X I×JKZ )iρ

(AZTZ )iρ + ε, Z = (C � B)

Bjρ ← Bjρ(X J×IKZ )jρ

(BZTZ )jρ + ε, Z = (C � A)

Ckρ ← Ckρ(XK×IJZ )kρ

(CZTZ )kρ + ε, Z = (B � A)

30 / 40

Tensor-Generated Group Discussions

NNTF Group Discussions in 2001197 authors; 8 distinguishable discussions“Kaminski/Education” topic previously unseen

J F M A M J J A S O N D0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Month

Nor

mal

ized

sca

le

Group 3

Group 4

Group 10

Group 12

Group 17

Group 20

Group 21

Group 25

CA/Energy

India

FastowCompanies

Downfall

CollegeFootball

Kaminski/Education

Downfall newsfeeds

31 / 40

Gantt Charts from PARAFAC ModelsNNTF/PARAFAC PARAFAC

Group Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

Key

Interval# Bars

[.01,0.1) [0.1, 0.3) [0.3, 0.5) [0.5, 0.7) [0.7,1.0)

(Gray = unclassified topic)

24

25

Downfall Newsfeeds

California Energy

College Football

Fastow Companies

18

15

16

17

23

12

13

14

19

20

21

22

India

Education (Kaminski)

11

5

6

7

8

9

10

Downfall

India

1

2

3

4

Group Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

College Football

India

College Football

Key

# Bars[.01,0.1) [0.1, 0.3)Interval [-1.0,.01) [0.3, 0.5) [0.5, 0.7) [0.7,1.0)

(Gray = unclassified topic)

24

25

Downfall

18

15

16

17

23

12

13

22

Education (Kaminski)

11

14

19

20

21

9

10

5

6

7

8

California

1

2

3

4

32 / 40

Day-level Analysis for PARAFAC (Three Groups)

Rank-25 tensor for 357 out of 365 days of 2001:A (69, 157× 25), B (197× 25), C (357× 25)Groups 3,4,5:

33 / 40

Day-level Analysis for NN-PARAFAC (Three Groups)

Rank-25 tensor (best minimizer) for 357 out of 365 daysof 2001: A (69, 157× 25), B (197× 25), C (357× 25)Groups 1,7,8:

34 / 40

Day-level Analysis for NN-PARAFAC (Two Groups)

Groups 20 (California Energy) and 9 (Football) (from C factorof best minimizer) in day-level analysis of 2001:

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

0.20.40.60.8

1Weekly pro & college football

betting pool at EnronDiscussions about energyprojects in California

Month

Conv

ersa

tion

Leve

l

= ...+ +

Term

s

Authors

Days

Term

s

Authors

Days

Term-author-dayarray

Conversation #1 Conversation #2

35 / 40

Four-way Tensor Results (Sept. 2007)

Apply NN-PARAFAC to term-author-recipient-day array(39, 573× 197× 197× 357); construct a rank-25 tensor(best minimizer among 10 runs).

Goal: track more focused discussions between individuals/small groups; for example, betting pool (football).


0.2

0.4

0.6

0.8

1

Weekly pro & college footballbetting pool at Enron

Con

vers

atio

n Le

vel

3−way Results

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0

0.5

1

From the list of 197 recipients3 individuals appeared one week

Month

Con

vers

atio

n Le

vel

4−way Results

36 / 40

Four-way Tensor Results (Sept. 2007)

Four-way tensor may track subconversation already found bythree-way tensor; for example, RTO (Regional TransmissionOrganization) discussions.


0.2

0.4

0.6

0.8 Conversation about FERC and RegionalTransmission Organizations (RTOs)

Con

vers

atio

n Le

vel 3−way Results

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0

0.5

1

Subconversation between J. Steffes and3 other VPs on the same topic

Month

Con

vers

atio

n Le

vel 4−way Results

37 / 40

NNTF Optimal Rank?

No known algorithm for computing the rank of a k-way arrayfor k ≥ 3 [Kruskal, 1989].

The maximum rank is not a closed set for a given randomtensor.

The maximum rank of a m × n × k tensor is unknown; oneweak inequality is given by

max{m, n, k} ≤ rank ≤ min{m × n,m × k, n × k}

For our rank-25 NNTF, the size of the relative residual normsuggests we are still far from the maximum rank of the 3-wayand 4-way arrays.

38 / 40

Further Reading

I M. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons.Alg. and Applic. for Approx. Nonnegative Matrix Factorization.Comput. Stat. & Data Anal. 52(1):155-173, 2007.

I F. Shahnaz, M.W. Berry, V.P. Pauca, and R.J. Plemmons.Document Clustering Using Nonnegative Matrix Factorization.Info. Proc. & Management 42(2):373-386, 2006.

I M.W. Berry and M. Browne.Email Surveillance Using Nonnegative Matrix Factorization.Comp. & Math. Org. Theory 11:249-264, 2005.

I P. Hoyer.Nonnegative Matrix Factorization with Sparseness Constraints.J. Machine Learning Research 5:1457-1469, 2004.

39 / 40

Further Reading (contd.)

I J.T. Giles and L. Wo and M.W. Berry.GTP (General Text Parser) Software for Text Mining.Software for Text Mining, in Statistical Data Mining andKnowledge Discovery. CRC Press, Boca Raton, FL, 2003, pp.455-471.

I W. Xu, X. Liu, and Y. Gong.Document-Clustering based on Nonneg. Matrix Factorization.Proceedings of SIGIR’03, Toronto, CA, 2003, pp. 267-273.

I J.B. Kruskal.Rank, Decomposition, and Uniqueness for 3-way and n-wayArrays.In Multiway Data Analysis, Eds. R. Coppi and S. Bolaso,Elsevier 1989, pp. 7-18.

40 / 40