automating the detection of anomalies and trends from text...

40
Automating the Detection of Anomalies and Trends from Text NGDM’07 Workshop Baltimore, MD Michael W. Berry Department of Electrical Engineering & Computer Science University of Tennessee October 11, 2007 1 / 40

Upload: others

Post on 24-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Automating the Detection of Anomalies andTrends from TextNGDM’07 Workshop

Baltimore, MD

Michael W. Berry

Department of Electrical Engineering & Computer ScienceUniversity of Tennessee

October 11, 2007

1 / 40

Page 2: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

1 Nonnegative Matrix Factorization (NNMF)MotivationUnderlying Optimization ProblemMM Method (Lee and Seung)Smoothing and Sparsity ConstraintsHybrid NNMF Approach

2 Anomaly Detection in ASRS CollectionDocument Parsing and Term WeightingPreliminary TrainingSDM07 Contest Performance

3 NNTF Classification of Enron EmailCorpus and Historical EventsDiscussion Tracking via PARAFAC/Tensor FactorizationMultidimensional Data AnalysisPARAFAC Model

4 References

2 / 40

Page 3: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

NNMF Origins

NNMF (Nonnegative Matrix Factorization) can be used toapproximate high-dimensional data having nonnegativecomponents.

Lee and Seung (1999) demonstrated its use as a sum-by-partsrepresentation of image data in order to both identify andclassify image features.

Xu et al. (2003) demonstrated how NNMF-based indexingcould outperform SVD-based Latent Semantic Indexing (LSI)for some information retrieval tasks.

3 / 40

Page 4: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

NNMF for Image Processing

SVD

W H

Ai

i

U ViΣ

Sparse NNMF verses Dense SVD Bases; Lee and Seung (1999)

4 / 40

Page 5: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

NNMF Analogue for Text Mining (Medlars)

0 1 2 3 4

10987654321 ventricular

aorticseptalleftdefectregurgitationventriclevalvecardiacpressure

Highest Weighted Terms in Basis Vector W*1

weight

term

0 0.5 1 1.5 2 2.5

10987654321 oxygen

flowpressurebloodcerebralhypothermiafluidvenousarterialperfusion

Highest Weighted Terms in Basis Vector W*2

weight

term

0 1 2 3 4

10987654321 children

childautisticspeechgroupearlyvisualanxietyemotionalautism

Highest Weighted Terms in Basis Vector W*5

weight

term

0 0.5 1 1.5 2 2.5

10987654321 kidney

marrowdnacellsnephrectomyunilaterallymphocytesbonethymidinerats

Highest Weighted Terms in Basis Vector W*6

weight

term

Interpretable NNMF feature vectors; Langville et al. (2006)

5 / 40

Page 6: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Derivation

Given an m × n term-by-document (sparse) matrix X .

Compute two reduced-dim. matrices W ,H so that X 'WH;W is m × r and H is r × n, with r � n.

Optimization problem:

minW ,H‖X −WH‖2F ,

subject to Wij ≥ 0 and Hij ≥ 0, ∀i , j .General approach: construct initial estimates for W and Hand then improve them via alternating iterations.

6 / 40

Page 7: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Minimization Challenges and Formulations[Berry et al., 2007]

Local Minima: Non-convexity of functionalf (W ,H) = 1

2‖X −WH‖2F in both W and H.

Non-unique Solutions: WDD−1H is nonnegative for anynonnegative (and invertible) D.

Many NNMF Formulations:Lee and Seung (2001) – information theoretic formulationbased on Kullback-Leibler divergence of X from WH.Guillamet, Bressan, and Vitria (2001) – diagonal weight matrixQ used (XQ ≈WHQ) to compensate for feature redundancy(columns of W ).Wang, Jiar, Hu, and Turk (2004) – constraint-basedformulation using Fisher linear discriminant analysis to improveextraction of spatially localized features.Other Cost Function Formulations – Hamza and Brady (2006),Dhillon and Sra (2005), Cichocki, Zdunek, and Amari (2006)

7 / 40

Page 8: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Multiplicative Method (MM)

Multiplicative update rules for W and H (Lee and Seung,1999):

1 Initialize W and H with nonnegative values, and scale thecolumns of W to unit norm.

2 Iterate for each c , j , and i until convergence or after kiterations:

1 Hcj ← Hcj(W TX )cj

(W TWH)cj + ε

2 Wic ←Wic(XHT )ic

(WHHT )ic + ε3 Scale the columns of W to unit norm.

Setting ε = 10−9 will suffice to avoid division by zero[Shahnaz et al., 2006].

8 / 40

Page 9: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Multiplicative Method (MM) contd.

Multiplicative Update MATLAB R©Code for NNMF

W = rand(m,k); % W initially random

H = rand(k,n); % H initially random

for i = 1 : maxiterH = H .* (WTA) ./ (WTWH + ε);W = W .* (AHT) ./ (WHHT + ε);

end

9 / 40

Page 10: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Lee and Seung MM Convergence

Convergence: when the MM algorithm converges to a limitpoint in the interior of the feasible region, the point is astationary point. The stationary point may or may not be alocal minimum. If the limit point lies on the boundary of thefeasible region, one cannot determine its stationarity[Berry et al., 2007].

Several modifications have been proposed: Gonzalez andZhang (2005) accelerated convergence somewhat butstationarity issue remains; Lin (2005) modified the algorithmto guarantee convergence to a stationary point; Dhillon andSra (2005) derived update rules that incorporate weights forthe importance of certain features of the approximation.

10 / 40

Page 11: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Hoyer’s Method

From neural network applications, Hoyer (2002) enforcedstatistical sparsity for the weight matrix H in order to enhancethe parts-based data representations in the matrix W .

Mu et al. (2003) suggested a regularization approach toachieve statistical sparsity in the matrix H: point countregularization; penalize the number of nonzeros in H ratherthan

∑ij Hij .

Goal of increased sparsity (or smoothness) – betterrepresentation of parts or features spanned by the corpus (X )[Berry and Browne, 2005].

11 / 40

Page 12: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

GD-CLS – Hybrid Approach

First use MM to compute an approximation to W for eachiteration – a gradient descent (GD) optimization step.

Then, compute the weight matrix H using a constrained leastsquares (CLS) model to penalize non-smoothness (i.e.,non-sparsity) in H – common Tikohonov regularizationtechnique used in image processing (Prasad et al., 2003).

Convergence to a non-stationary point evidenced (proof stillneeded).

12 / 40

Page 13: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

GD-CLS Algorithm

1 Initialize W and H with nonnegative values, and scale thecolumns of W to unit norm.

2 Iterate until convergence or after k iterations:

1 Wic ←Wic(XHT )ic

(WHHT )ic + ε, for c and i

2 Rescale the columns of W to unit norm.3 Solve the constrained least squares problem:

minHj

{‖Xj −WHj‖22 + λ‖Hj‖22},

where the subscript j denotes the j th column, for j = 1, . . . ,m.

Any negative values in Hj are set to zero. The parameter λ isa regularization value that is used to balance the reduction ofthe metric ‖Xj −WHj‖22 with enforcement of smoothness andsparsity in H.

13 / 40

Page 14: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Two Penalty Term Formulation

Introduce smoothing on Wk (feature vectors) in addition toHk :

minW ,H{‖X −WH‖2F + α‖W ‖2F + β‖H‖2F},

where ‖ · ‖F is the Frobenius norm.

Constrained NNMF (CNMF) iteration:

Hcj ← Hcj(W TX )cj − βHcj

(W TWH)cj + ε

Wic ←Wic(XHT )ic − αWic

(WHHT )ic + ε

14 / 40

Page 15: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Improving Feature Interpretability

Gauging Parameters for Constrained Optimization

How sparse (or smooth) should factors (W ,H) be to produce asmany interpretable features as possible?

To what extent do different norms (l1, l2, l∞) improve/degradatefeature quality or span? At what cost?

Can a nonnegative feature space be built from objects in bothimages and text? Are there opportunities for multimodal documentsimilarity?

15 / 40

Page 16: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Anomaly Detection (ASRS)

Classify events described by documents from the AirlineSafety Reporting System (ASRS) into 22 anomaly categories;contest from SDM07 Text Mining Workshop.

General Text Parsing (GTP) Software Environment in C++[Giles et al., 2003] used to parse both ASRS training set and acombined ASRS training and test set:

Dataset Terms ASRS DocumentsTraining 15,722 21,519

Training+Test 17,994 28,596 (7,077)

Global and document frequency of required to be at least 2;stoplist of 493 common words used; char length of any term∈ [2, 200].

Download Information:GTP: http://www.cs.utk.edu/∼lsi

ASRS: http://www.cs.utk.edu/tmw07

16 / 40

Page 17: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Initialization Schematic

1 n

H

Feat

ures

Documents

H

(Filtered)

1 nDocuments

Feat

ures

R T

k1 1

k

17 / 40

Page 18: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Anomaly to Feature Mapping and Scoring Schematic

kk

11Fe

atur

es

Feat

ures

TT 1

Extract Anomalies Per Feature

R

H H

Documents in RAnomaly 1

1 n

22

18 / 40

Page 19: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Training/Testing Performance (ROC Curves)

Best/Worst ROC curves (False Positive Rate versus TruePositive Rate)

ROC AreaAnomaly Type (Description) Training Contest

22 Security Concern/Threat .9040 .89255 Incursion (collision hazard) .8977 .87164 Excursion (loss of control) .8296 .7159

21 Illness/Injury Event .8201 .817212 Traffic Proximity Event .7954 .77517 Altitude Deviation .7931 .8085

18 Aircraft Damage/Encounter .7250 .726111 Terrain Proximity Event .7234 .75759 Speed Deviation .7060 .6893

10 Uncommanded (loss of control) .6784 .650413 Weather Issue .6287 .60182 Noncompliance (policy/proc.) .6009 .5551

19 / 40

Page 20: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Anomaly Summarization Prototype - Sentence Ranking

Sentence rank = f(global term weights) – B. Lamb

20 / 40

Page 21: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Improving Summarization and Steering

What versus why:

Extraction of textual concepts still requires human interpretation(in the absence of ontologies or domain-specific classifications).

How can previous knowledge or experience be captured for featurematching (or pruning)?

To what extent can feature vectors be annotated for future use oras the text collection is updated? What is the cost for updatingthe NNMF (or similar) model?

21 / 40

Page 22: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Unresolved Modeling Issues

Parameters and dimensionality:

Further work needed in determining effects of alternative termweighting schemes (for X ) and choices of control parameters(e.g., α, β for CNMF).

How does document (or object) clustering change with differentranks (or features)?

How should feature vectors from competing models (Bayesian,neural nets, etc.) be compared in both interpretability andcomputational cost?

22 / 40

Page 23: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Email Collection

By-product of the FERC investigation of Enron (originallycontained 15 million email messages).

This study used the improved corpus known as the EnronEmail set, which was edited by Dr. William Cohen at CMU.

This set had over 500,000 email messages. The majority weresent in the 1999 to 2001 timeframe.

23 / 40

Page 24: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Enron Historical 1999-2001

Ongoing, problematic, development of the Dabhol PowerCompany (DPC) in the Indian state of Maharashtra.

Deregulation of the Calif. energy industry, which led to rollingelectricity blackouts in the summer of 2000 (and subsequentinvestigations).

Revelation of Enron’s deceptive business and accountingpractices that led to an abrupt collapse of the energy colossusin October, 2001; Enron filed for bankruptcy in December,2001.

24 / 40

Page 25: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Multidimensional Data Analysis via PARAFACNew Paradigm:Multidimensional Data Analysis

+ + ...Third dimension offers more explanatory power: uncovers new

latent information and reveals subtle relationships

Build a 3-way array such that there is a term-author matrix for each month.

PARAFAC

Multilinearalgebra

term-authormatrix

term-author-montharray

Email graph

+ + ...

NonnegativePARAFAC

25 / 40

Page 26: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Temporal Assessment via PARAFAC

Objective

Use PARAFAC to analyze content of email communications over time

David

Ellen

Bob

Frank

Alice Carl

IngridHenk

Gary

term-author-montharray

+ + ...=

26 / 40

Page 27: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Mathematical Notation

Kronecker product

A⊗ B =

A11B · · · A1nB...

. . ....

Am1B · · · AmnB

Khatri-Rao product (columnwise Kronecker)

A� B =(A1 ⊗ B1 · · · An ⊗ Bn

)Outer product

A1 ◦ B1 =

A11B11 · · · A11Bm1...

. . ....

Am1B11 · · · Am1Bm1

27 / 40

Page 28: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

PARAFAC Representations

PARAllel FACtors (Harshman, 1970)

Also known as CANDECOMP (Carroll & Chang, 1970)

Typically solved by Alternating Least Squares (ALS)

Alternative PARAFAC formulations

Xijk ≈r∑

i=1

AirBjrCkr

X ≈r∑

i=1

Ai ◦ Bi ◦ Ci , where X is a 3-way array (tensor).

Xk ≈ A diag(Ck:) BT , where Xk is a tensor slice.

X I×JK ≈ A(C � B)T , where X is matricized.

28 / 40

Page 29: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

PARAFAC (Visual) Representations

+ + ...

Scalar form Outer product form

Tensor slice form Matrix form

=

29 / 40

Page 30: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Nonnegative PARAFAC Algorithm

Adapted from (Mørup, 2005) and based on NNMF by (Leeand Seung, 2001)

||X I×JK − A(C � B)T ||F = ||X J×IK − B(C � A)T ||F= ||XK×IJ − C (B � A)T ||F

Minimize over A, B, C using multiplicative update rule:

Aiρ ← Aiρ(X I×JKZ )iρ

(AZTZ )iρ + ε, Z = (C � B)

Bjρ ← Bjρ(X J×IKZ )jρ

(BZTZ )jρ + ε, Z = (C � A)

Ckρ ← Ckρ(XK×IJZ )kρ

(CZTZ )kρ + ε, Z = (B � A)

30 / 40

Page 31: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Tensor-Generated Group Discussions

NNTF Group Discussions in 2001197 authors; 8 distinguishable discussions“Kaminski/Education” topic previously unseen

J F M A M J J A S O N D0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Month

Nor

mal

ized

sca

le

Group 3

Group 4

Group 10

Group 12

Group 17

Group 20

Group 21

Group 25

CA/Energy

India

FastowCompanies

Downfall

CollegeFootball

Kaminski/Education

Downfall newsfeeds

31 / 40

Page 32: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Gantt Charts from PARAFAC ModelsNNTF/PARAFAC PARAFAC

Group Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

Key

Interval# Bars

[.01,0.1) [0.1, 0.3) [0.3, 0.5) [0.5, 0.7) [0.7,1.0)

(Gray = unclassified topic)

24

25

Downfall Newsfeeds

California Energy

College Football

Fastow Companies

18

15

16

17

23

12

13

14

19

20

21

22

India

Education (Kaminski)

11

5

6

7

8

9

10

Downfall

India

1

2

3

4

Group Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

College Football

India

College Football

Key

# Bars[.01,0.1) [0.1, 0.3)Interval [-1.0,.01) [0.3, 0.5) [0.5, 0.7) [0.7,1.0)

(Gray = unclassified topic)

24

25

Downfall

18

15

16

17

23

12

13

22

Education (Kaminski)

11

14

19

20

21

9

10

5

6

7

8

California

1

2

3

4

32 / 40

Page 33: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Day-level Analysis for PARAFAC (Three Groups)

Rank-25 tensor for 357 out of 365 days of 2001:A (69, 157× 25), B (197× 25), C (357× 25)Groups 3,4,5:

33 / 40

Page 34: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Day-level Analysis for NN-PARAFAC (Three Groups)

Rank-25 tensor (best minimizer) for 357 out of 365 daysof 2001: A (69, 157× 25), B (197× 25), C (357× 25)Groups 1,7,8:

34 / 40

Page 35: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Day-level Analysis for NN-PARAFAC (Two Groups)

Groups 20 (California Energy) and 9 (Football) (from C factorof best minimizer) in day-level analysis of 2001:

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

0.20.40.60.8

1Weekly pro & college football

betting pool at EnronDiscussions about energyprojects in California

Month

Conv

ersa

tion

Leve

l

= ...+ +

Term

s

Authors

Days

Term

s

Authors

Days

Term-author-dayarray

Conversation #1 Conversation #2

35 / 40

Page 36: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Four-way Tensor Results (Sept. 2007)

Apply NN-PARAFAC to term-author-recipient-day array(39, 573× 197× 197× 357); construct a rank-25 tensor(best minimizer among 10 runs).

Goal: track more focused discussions between individuals/small groups; for example, betting pool (football).

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

0.2

0.4

0.6

0.8

1

Weekly pro & college footballbetting pool at Enron

Con

vers

atio

n Le

vel

3−way Results

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0

0.5

1

From the list of 197 recipients3 individuals appeared one week

Month

Con

vers

atio

n Le

vel

4−way Results

36 / 40

Page 37: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Four-way Tensor Results (Sept. 2007)

Four-way tensor may track subconversation already found bythree-way tensor; for example, RTO (Regional TransmissionOrganization) discussions.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

0.2

0.4

0.6

0.8 Conversation about FERC and RegionalTransmission Organizations (RTOs)

Con

vers

atio

n Le

vel 3−way Results

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0

0.5

1

Subconversation between J. Steffes and3 other VPs on the same topic

Month

Con

vers

atio

n Le

vel 4−way Results

37 / 40

Page 38: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

NNTF Optimal Rank?

No known algorithm for computing the rank of a k-way arrayfor k ≥ 3 [Kruskal, 1989].

The maximum rank is not a closed set for a given randomtensor.

The maximum rank of a m × n × k tensor is unknown; oneweak inequality is given by

max{m, n, k} ≤ rank ≤ min{m × n,m × k, n × k}

For our rank-25 NNTF, the size of the relative residual normsuggests we are still far from the maximum rank of the 3-wayand 4-way arrays.

38 / 40

Page 39: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Further Reading

I M. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons.Alg. and Applic. for Approx. Nonnegative Matrix Factorization.Comput. Stat. & Data Anal. 52(1):155-173, 2007.

I F. Shahnaz, M.W. Berry, V.P. Pauca, and R.J. Plemmons.Document Clustering Using Nonnegative Matrix Factorization.Info. Proc. & Management 42(2):373-386, 2006.

I M.W. Berry and M. Browne.Email Surveillance Using Nonnegative Matrix Factorization.Comp. & Math. Org. Theory 11:249-264, 2005.

I P. Hoyer.Nonnegative Matrix Factorization with Sparseness Constraints.J. Machine Learning Research 5:1457-1469, 2004.

39 / 40

Page 40: Automating the Detection of Anomalies and Trends from Text ...hillol/NGDM07/abstracts/slides/Berry_n… · 7 6 5 4 3 2 1 ventricular aortic septal left defect regurgitation ventricle

Further Reading (contd.)

I J.T. Giles and L. Wo and M.W. Berry.GTP (General Text Parser) Software for Text Mining.Software for Text Mining, in Statistical Data Mining andKnowledge Discovery. CRC Press, Boca Raton, FL, 2003, pp.455-471.

I W. Xu, X. Liu, and Y. Gong.Document-Clustering based on Nonneg. Matrix Factorization.Proceedings of SIGIR’03, Toronto, CA, 2003, pp. 267-273.

I J.B. Kruskal.Rank, Decomposition, and Uniqueness for 3-way and n-wayArrays.In Multiway Data Analysis, Eds. R. Coppi and S. Bolaso,Elsevier 1989, pp. 7-18.

40 / 40