automating the detection of anomalies and trends from text...
TRANSCRIPT
Automating the Detection of Anomalies andTrends from TextNGDM’07 Workshop
Baltimore, MD
Michael W. Berry
Department of Electrical Engineering & Computer ScienceUniversity of Tennessee
October 11, 2007
1 / 40
1 Nonnegative Matrix Factorization (NNMF)MotivationUnderlying Optimization ProblemMM Method (Lee and Seung)Smoothing and Sparsity ConstraintsHybrid NNMF Approach
2 Anomaly Detection in ASRS CollectionDocument Parsing and Term WeightingPreliminary TrainingSDM07 Contest Performance
3 NNTF Classification of Enron EmailCorpus and Historical EventsDiscussion Tracking via PARAFAC/Tensor FactorizationMultidimensional Data AnalysisPARAFAC Model
4 References
2 / 40
NNMF Origins
NNMF (Nonnegative Matrix Factorization) can be used toapproximate high-dimensional data having nonnegativecomponents.
Lee and Seung (1999) demonstrated its use as a sum-by-partsrepresentation of image data in order to both identify andclassify image features.
Xu et al. (2003) demonstrated how NNMF-based indexingcould outperform SVD-based Latent Semantic Indexing (LSI)for some information retrieval tasks.
3 / 40
NNMF for Image Processing
SVD
W H
Ai
i
U ViΣ
Sparse NNMF verses Dense SVD Bases; Lee and Seung (1999)
4 / 40
NNMF Analogue for Text Mining (Medlars)
0 1 2 3 4
10987654321 ventricular
aorticseptalleftdefectregurgitationventriclevalvecardiacpressure
Highest Weighted Terms in Basis Vector W*1
weight
term
0 0.5 1 1.5 2 2.5
10987654321 oxygen
flowpressurebloodcerebralhypothermiafluidvenousarterialperfusion
Highest Weighted Terms in Basis Vector W*2
weight
term
0 1 2 3 4
10987654321 children
childautisticspeechgroupearlyvisualanxietyemotionalautism
Highest Weighted Terms in Basis Vector W*5
weight
term
0 0.5 1 1.5 2 2.5
10987654321 kidney
marrowdnacellsnephrectomyunilaterallymphocytesbonethymidinerats
Highest Weighted Terms in Basis Vector W*6
weight
term
Interpretable NNMF feature vectors; Langville et al. (2006)
5 / 40
Derivation
Given an m × n term-by-document (sparse) matrix X .
Compute two reduced-dim. matrices W ,H so that X 'WH;W is m × r and H is r × n, with r � n.
Optimization problem:
minW ,H‖X −WH‖2F ,
subject to Wij ≥ 0 and Hij ≥ 0, ∀i , j .General approach: construct initial estimates for W and Hand then improve them via alternating iterations.
6 / 40
Minimization Challenges and Formulations[Berry et al., 2007]
Local Minima: Non-convexity of functionalf (W ,H) = 1
2‖X −WH‖2F in both W and H.
Non-unique Solutions: WDD−1H is nonnegative for anynonnegative (and invertible) D.
Many NNMF Formulations:Lee and Seung (2001) – information theoretic formulationbased on Kullback-Leibler divergence of X from WH.Guillamet, Bressan, and Vitria (2001) – diagonal weight matrixQ used (XQ ≈WHQ) to compensate for feature redundancy(columns of W ).Wang, Jiar, Hu, and Turk (2004) – constraint-basedformulation using Fisher linear discriminant analysis to improveextraction of spatially localized features.Other Cost Function Formulations – Hamza and Brady (2006),Dhillon and Sra (2005), Cichocki, Zdunek, and Amari (2006)
7 / 40
Multiplicative Method (MM)
Multiplicative update rules for W and H (Lee and Seung,1999):
1 Initialize W and H with nonnegative values, and scale thecolumns of W to unit norm.
2 Iterate for each c , j , and i until convergence or after kiterations:
1 Hcj ← Hcj(W TX )cj
(W TWH)cj + ε
2 Wic ←Wic(XHT )ic
(WHHT )ic + ε3 Scale the columns of W to unit norm.
Setting ε = 10−9 will suffice to avoid division by zero[Shahnaz et al., 2006].
8 / 40
Multiplicative Method (MM) contd.
Multiplicative Update MATLAB R©Code for NNMF
W = rand(m,k); % W initially random
H = rand(k,n); % H initially random
for i = 1 : maxiterH = H .* (WTA) ./ (WTWH + ε);W = W .* (AHT) ./ (WHHT + ε);
end
9 / 40
Lee and Seung MM Convergence
Convergence: when the MM algorithm converges to a limitpoint in the interior of the feasible region, the point is astationary point. The stationary point may or may not be alocal minimum. If the limit point lies on the boundary of thefeasible region, one cannot determine its stationarity[Berry et al., 2007].
Several modifications have been proposed: Gonzalez andZhang (2005) accelerated convergence somewhat butstationarity issue remains; Lin (2005) modified the algorithmto guarantee convergence to a stationary point; Dhillon andSra (2005) derived update rules that incorporate weights forthe importance of certain features of the approximation.
10 / 40
Hoyer’s Method
From neural network applications, Hoyer (2002) enforcedstatistical sparsity for the weight matrix H in order to enhancethe parts-based data representations in the matrix W .
Mu et al. (2003) suggested a regularization approach toachieve statistical sparsity in the matrix H: point countregularization; penalize the number of nonzeros in H ratherthan
∑ij Hij .
Goal of increased sparsity (or smoothness) – betterrepresentation of parts or features spanned by the corpus (X )[Berry and Browne, 2005].
11 / 40
GD-CLS – Hybrid Approach
First use MM to compute an approximation to W for eachiteration – a gradient descent (GD) optimization step.
Then, compute the weight matrix H using a constrained leastsquares (CLS) model to penalize non-smoothness (i.e.,non-sparsity) in H – common Tikohonov regularizationtechnique used in image processing (Prasad et al., 2003).
Convergence to a non-stationary point evidenced (proof stillneeded).
12 / 40
GD-CLS Algorithm
1 Initialize W and H with nonnegative values, and scale thecolumns of W to unit norm.
2 Iterate until convergence or after k iterations:
1 Wic ←Wic(XHT )ic
(WHHT )ic + ε, for c and i
2 Rescale the columns of W to unit norm.3 Solve the constrained least squares problem:
minHj
{‖Xj −WHj‖22 + λ‖Hj‖22},
where the subscript j denotes the j th column, for j = 1, . . . ,m.
Any negative values in Hj are set to zero. The parameter λ isa regularization value that is used to balance the reduction ofthe metric ‖Xj −WHj‖22 with enforcement of smoothness andsparsity in H.
13 / 40
Two Penalty Term Formulation
Introduce smoothing on Wk (feature vectors) in addition toHk :
minW ,H{‖X −WH‖2F + α‖W ‖2F + β‖H‖2F},
where ‖ · ‖F is the Frobenius norm.
Constrained NNMF (CNMF) iteration:
Hcj ← Hcj(W TX )cj − βHcj
(W TWH)cj + ε
Wic ←Wic(XHT )ic − αWic
(WHHT )ic + ε
14 / 40
Improving Feature Interpretability
Gauging Parameters for Constrained Optimization
How sparse (or smooth) should factors (W ,H) be to produce asmany interpretable features as possible?
To what extent do different norms (l1, l2, l∞) improve/degradatefeature quality or span? At what cost?
Can a nonnegative feature space be built from objects in bothimages and text? Are there opportunities for multimodal documentsimilarity?
15 / 40
Anomaly Detection (ASRS)
Classify events described by documents from the AirlineSafety Reporting System (ASRS) into 22 anomaly categories;contest from SDM07 Text Mining Workshop.
General Text Parsing (GTP) Software Environment in C++[Giles et al., 2003] used to parse both ASRS training set and acombined ASRS training and test set:
Dataset Terms ASRS DocumentsTraining 15,722 21,519
Training+Test 17,994 28,596 (7,077)
Global and document frequency of required to be at least 2;stoplist of 493 common words used; char length of any term∈ [2, 200].
Download Information:GTP: http://www.cs.utk.edu/∼lsi
ASRS: http://www.cs.utk.edu/tmw07
16 / 40
Initialization Schematic
1 n
H
Feat
ures
Documents
H
(Filtered)
1 nDocuments
Feat
ures
R T
k1 1
k
17 / 40
Anomaly to Feature Mapping and Scoring Schematic
kk
11Fe
atur
es
Feat
ures
TT 1
Extract Anomalies Per Feature
R
H H
Documents in RAnomaly 1
1 n
22
18 / 40
Training/Testing Performance (ROC Curves)
Best/Worst ROC curves (False Positive Rate versus TruePositive Rate)
ROC AreaAnomaly Type (Description) Training Contest
22 Security Concern/Threat .9040 .89255 Incursion (collision hazard) .8977 .87164 Excursion (loss of control) .8296 .7159
21 Illness/Injury Event .8201 .817212 Traffic Proximity Event .7954 .77517 Altitude Deviation .7931 .8085
18 Aircraft Damage/Encounter .7250 .726111 Terrain Proximity Event .7234 .75759 Speed Deviation .7060 .6893
10 Uncommanded (loss of control) .6784 .650413 Weather Issue .6287 .60182 Noncompliance (policy/proc.) .6009 .5551
19 / 40
Anomaly Summarization Prototype - Sentence Ranking
Sentence rank = f(global term weights) – B. Lamb
20 / 40
Improving Summarization and Steering
What versus why:
Extraction of textual concepts still requires human interpretation(in the absence of ontologies or domain-specific classifications).
How can previous knowledge or experience be captured for featurematching (or pruning)?
To what extent can feature vectors be annotated for future use oras the text collection is updated? What is the cost for updatingthe NNMF (or similar) model?
21 / 40
Unresolved Modeling Issues
Parameters and dimensionality:
Further work needed in determining effects of alternative termweighting schemes (for X ) and choices of control parameters(e.g., α, β for CNMF).
How does document (or object) clustering change with differentranks (or features)?
How should feature vectors from competing models (Bayesian,neural nets, etc.) be compared in both interpretability andcomputational cost?
22 / 40
Email Collection
By-product of the FERC investigation of Enron (originallycontained 15 million email messages).
This study used the improved corpus known as the EnronEmail set, which was edited by Dr. William Cohen at CMU.
This set had over 500,000 email messages. The majority weresent in the 1999 to 2001 timeframe.
23 / 40
Enron Historical 1999-2001
Ongoing, problematic, development of the Dabhol PowerCompany (DPC) in the Indian state of Maharashtra.
Deregulation of the Calif. energy industry, which led to rollingelectricity blackouts in the summer of 2000 (and subsequentinvestigations).
Revelation of Enron’s deceptive business and accountingpractices that led to an abrupt collapse of the energy colossusin October, 2001; Enron filed for bankruptcy in December,2001.
24 / 40
Multidimensional Data Analysis via PARAFACNew Paradigm:Multidimensional Data Analysis
+ + ...Third dimension offers more explanatory power: uncovers new
latent information and reveals subtle relationships
Build a 3-way array such that there is a term-author matrix for each month.
PARAFAC
Multilinearalgebra
term-authormatrix
term-author-montharray
Email graph
+ + ...
NonnegativePARAFAC
25 / 40
Temporal Assessment via PARAFAC
Objective
Use PARAFAC to analyze content of email communications over time
David
Ellen
Bob
Frank
Alice Carl
IngridHenk
Gary
term-author-montharray
+ + ...=
26 / 40
Mathematical Notation
Kronecker product
A⊗ B =
A11B · · · A1nB...
. . ....
Am1B · · · AmnB
Khatri-Rao product (columnwise Kronecker)
A� B =(A1 ⊗ B1 · · · An ⊗ Bn
)Outer product
A1 ◦ B1 =
A11B11 · · · A11Bm1...
. . ....
Am1B11 · · · Am1Bm1
27 / 40
PARAFAC Representations
PARAllel FACtors (Harshman, 1970)
Also known as CANDECOMP (Carroll & Chang, 1970)
Typically solved by Alternating Least Squares (ALS)
Alternative PARAFAC formulations
Xijk ≈r∑
i=1
AirBjrCkr
X ≈r∑
i=1
Ai ◦ Bi ◦ Ci , where X is a 3-way array (tensor).
Xk ≈ A diag(Ck:) BT , where Xk is a tensor slice.
X I×JK ≈ A(C � B)T , where X is matricized.
28 / 40
PARAFAC (Visual) Representations
+ + ...
Scalar form Outer product form
Tensor slice form Matrix form
=
29 / 40
Nonnegative PARAFAC Algorithm
Adapted from (Mørup, 2005) and based on NNMF by (Leeand Seung, 2001)
||X I×JK − A(C � B)T ||F = ||X J×IK − B(C � A)T ||F= ||XK×IJ − C (B � A)T ||F
Minimize over A, B, C using multiplicative update rule:
Aiρ ← Aiρ(X I×JKZ )iρ
(AZTZ )iρ + ε, Z = (C � B)
Bjρ ← Bjρ(X J×IKZ )jρ
(BZTZ )jρ + ε, Z = (C � A)
Ckρ ← Ckρ(XK×IJZ )kρ
(CZTZ )kρ + ε, Z = (B � A)
30 / 40
Tensor-Generated Group Discussions
NNTF Group Discussions in 2001197 authors; 8 distinguishable discussions“Kaminski/Education” topic previously unseen
J F M A M J J A S O N D0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Month
Nor
mal
ized
sca
le
Group 3
Group 4
Group 10
Group 12
Group 17
Group 20
Group 21
Group 25
CA/Energy
India
FastowCompanies
Downfall
CollegeFootball
Kaminski/Education
Downfall newsfeeds
31 / 40
Gantt Charts from PARAFAC ModelsNNTF/PARAFAC PARAFAC
Group Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec
Key
Interval# Bars
[.01,0.1) [0.1, 0.3) [0.3, 0.5) [0.5, 0.7) [0.7,1.0)
(Gray = unclassified topic)
24
25
Downfall Newsfeeds
California Energy
College Football
Fastow Companies
18
15
16
17
23
12
13
14
19
20
21
22
India
Education (Kaminski)
11
5
6
7
8
9
10
Downfall
India
1
2
3
4
Group Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec
College Football
India
College Football
Key
# Bars[.01,0.1) [0.1, 0.3)Interval [-1.0,.01) [0.3, 0.5) [0.5, 0.7) [0.7,1.0)
(Gray = unclassified topic)
24
25
Downfall
18
15
16
17
23
12
13
22
Education (Kaminski)
11
14
19
20
21
9
10
5
6
7
8
California
1
2
3
4
32 / 40
Day-level Analysis for PARAFAC (Three Groups)
Rank-25 tensor for 357 out of 365 days of 2001:A (69, 157× 25), B (197× 25), C (357× 25)Groups 3,4,5:
33 / 40
Day-level Analysis for NN-PARAFAC (Three Groups)
Rank-25 tensor (best minimizer) for 357 out of 365 daysof 2001: A (69, 157× 25), B (197× 25), C (357× 25)Groups 1,7,8:
34 / 40
Day-level Analysis for NN-PARAFAC (Two Groups)
Groups 20 (California Energy) and 9 (Football) (from C factorof best minimizer) in day-level analysis of 2001:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0.20.40.60.8
1Weekly pro & college football
betting pool at EnronDiscussions about energyprojects in California
Month
Conv
ersa
tion
Leve
l
= ...+ +
Term
s
Authors
Days
Term
s
Authors
Days
Term-author-dayarray
Conversation #1 Conversation #2
35 / 40
Four-way Tensor Results (Sept. 2007)
Apply NN-PARAFAC to term-author-recipient-day array(39, 573× 197× 197× 357); construct a rank-25 tensor(best minimizer among 10 runs).
Goal: track more focused discussions between individuals/small groups; for example, betting pool (football).
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0.2
0.4
0.6
0.8
1
Weekly pro & college footballbetting pool at Enron
Con
vers
atio
n Le
vel
3−way Results
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0
0.5
1
From the list of 197 recipients3 individuals appeared one week
Month
Con
vers
atio
n Le
vel
4−way Results
36 / 40
Four-way Tensor Results (Sept. 2007)
Four-way tensor may track subconversation already found bythree-way tensor; for example, RTO (Regional TransmissionOrganization) discussions.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0.2
0.4
0.6
0.8 Conversation about FERC and RegionalTransmission Organizations (RTOs)
Con
vers
atio
n Le
vel 3−way Results
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0
0.5
1
Subconversation between J. Steffes and3 other VPs on the same topic
Month
Con
vers
atio
n Le
vel 4−way Results
37 / 40
NNTF Optimal Rank?
No known algorithm for computing the rank of a k-way arrayfor k ≥ 3 [Kruskal, 1989].
The maximum rank is not a closed set for a given randomtensor.
The maximum rank of a m × n × k tensor is unknown; oneweak inequality is given by
max{m, n, k} ≤ rank ≤ min{m × n,m × k, n × k}
For our rank-25 NNTF, the size of the relative residual normsuggests we are still far from the maximum rank of the 3-wayand 4-way arrays.
38 / 40
Further Reading
I M. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons.Alg. and Applic. for Approx. Nonnegative Matrix Factorization.Comput. Stat. & Data Anal. 52(1):155-173, 2007.
I F. Shahnaz, M.W. Berry, V.P. Pauca, and R.J. Plemmons.Document Clustering Using Nonnegative Matrix Factorization.Info. Proc. & Management 42(2):373-386, 2006.
I M.W. Berry and M. Browne.Email Surveillance Using Nonnegative Matrix Factorization.Comp. & Math. Org. Theory 11:249-264, 2005.
I P. Hoyer.Nonnegative Matrix Factorization with Sparseness Constraints.J. Machine Learning Research 5:1457-1469, 2004.
39 / 40
Further Reading (contd.)
I J.T. Giles and L. Wo and M.W. Berry.GTP (General Text Parser) Software for Text Mining.Software for Text Mining, in Statistical Data Mining andKnowledge Discovery. CRC Press, Boca Raton, FL, 2003, pp.455-471.
I W. Xu, X. Liu, and Y. Gong.Document-Clustering based on Nonneg. Matrix Factorization.Proceedings of SIGIR’03, Toronto, CA, 2003, pp. 267-273.
I J.B. Kruskal.Rank, Decomposition, and Uniqueness for 3-way and n-wayArrays.In Multiway Data Analysis, Eds. R. Coppi and S. Bolaso,Elsevier 1989, pp. 7-18.
40 / 40