text mining and continuous assurance kevin moffitt - 12th contecsi 34th wcars
TRANSCRIPT
Text Mining and Continuous Assurance
Continuous Assurance
• Allows for the automated and frequent review of business data
• Current focus is on the structured data
– General ledgers
– Financial statements
– XBRL
• However, we cannot ignore the information found in unstructured
data
– Textual data, for example narrative portion of financial disclosures
• Up to 85% of the data in financial disclosures is in the form of text
Text Mining and Continuous Assurance
Text Mining
• Many methods for extracting data from text
• One popular method is to use dictionaries/word lists
• E.g. Dictionary to identify positive language in business
documents…
SATISFIES
PREEMINENT
REWARDED
BENEFITTING
SOLVING
COLLABORATIONS
BOOST
TREMENDOUS
GREATEST
PERFECTLY
DELIGHTING
COMPLIMENTING
EXCITING
REBOUNDED
CONCLUSIVE
ASSURE
INNOVATED
ENJOYING
CREATIVE
GREATLY
Text Mining and Continuous Assurance
Drawbacks of Dictionary Method
• Single words
– Context Free
– Naïve
Text Mining and Continuous Assurance
Lexical Bundles
• Frequent multi-word sequences in a given corpus (e.g. financial
reports, history journals, biology journals)
• More context in phrases than in individual words
• Criteria for identifying lexical bundles
– Sequences of words four words or longer
– Occurred in at least 15% unique documents
– Occurred at a rate of at least 20 times per million words
Example Lexical Bundles from Annual
Reports
the fair value of
be adversely affected by
as a percentage of
assets and liabilities and
Text Mining and Continuous Assurance
Lexical Bundles
• Research objective - Use Lexical Bundles to discriminate between
Fraudulent and Non-fraudulent Financial Reports
Text Mining and Continuous Assurance
Research Questions
• RQ1: What are the most frequently used lexical bundles in fraudulent and
non-fraudulent Management Discussion and Analysis section (MD&A) of
annual reports?
• RQ2: Which lexical bundles are used at a considerably different rate in
fraudulent and non-fraudulent MD&As?
• RQ3: Can lexical bundles be used to classify fraudulent and non-fraudulent
MD&As at a rate greater than chance?
Text Mining and Continuous Assurance
Sample Selection
• Identified 101 fraudulent annual reports (10-Ks) from set of SEC investigations
• Analyzed the Management Discussion and Analysis (MD&A) section
of 10-K
– Gives investors view of company from management’s perspective
– contains some of the least structured language in the 10-K
– Most read part of 10-K
Text Mining and Continuous Assurance
Sample Selection
Sample selection criteria for fraudulent 10-Ks
Companies identified as fraudulent by
searching through AAERs 141
Disqualified because fraud did not involve 10-
Ks (20)
Disqualified because 10-K was not available
from the EDGAR DB (10)
Disqualified because 10-K did not contain
management discussion section (10)
Final count of qualifying fraudulent 10-Ks used
in the sample 101
Text Mining and Continuous Assurance
Sample Selection—Types of Fraud
Type of Fraud Companies
Overstatement of revenues 44
Combination of overstating revenue and
understating expenses
25
Disclosure issue 10
Overstatement of inventory 6
Other income increasing effects 6
Understatement of provisions for loan-
loss reserves
5
Other 5
Text Mining and Continuous Assurance
Sample Selection – Non-Fraudulent sample
• 101 Matching Non-Fraudulent 10-Ks were identified
Text Mining and Continuous Assurance
Lexical Bundle Identification
• 560 Lexical Bundles were identified
Text Mining and Continuous Assurance
Creative Accounting
Lexical Bundle
Fraud Bundles Per
Million Words
NonFraud Bundles
Per Million Words
%
difference
in process
research and
development
199 76 160%
goodwill and other
intangible assets 121 82 47%
Text Mining and Continuous Assurance
Big Bath Charges
• Wholesale aggressive restructuring to improve
cost and expense structure for the future
– Disposition of long-lived assets
Lexical Bundle
Fraud Bundles Per
Million Words
NonFraud Bundles
Per Million Words
%
difference
disposition of long
lived assets and
49 21 139%
Text Mining and Continuous Assurance
Fair Value Accounting
• Subjective method for assigning value to an asset
– Change value of assets
– Understate debt obligations
– Misrepresent foreign currency exchange adjustments
Lexical Bundle
Fraud Bundles
Per Million Words
NonFraud Bundles
Per Million Words
%
difference
the fair value of 257 171 50%
in foreign
currency
exchange
41 21 97%
Text Mining and Continuous Assurance
Lexical Bundles used more Frequently in Non-Fraudulent
MD&As
• Conservative language for accounting practices
Lexical Bundle
Fraud Bundles Per
Million Words
NonFraud Bundles
Per Million Words
%
difference
to continue as a
going concern 15 91 513%
disclosures about
market risk 85 115 36%
material impact on
the 38 52 35%
Text Mining and Continuous Assurance
Principal Component Analysis
• Variable reduction procedure
– Combines correlated variables into principal components
• Principal components
– First component accounts for maximum amount of total variance in the observed variables
– Components are uncorrelated
• Components are made up of correlated variables
– Overlapping lexical bundles are combined
Correlated bundles transformed into one principal component
4-word bundles 6-word component
there can be no
there can be no assurance
can be no assurance there can be no
assurance that can be no assurance that
be no assurance that
Text Mining and Continuous Assurance
Principal Component Analysis
• 560 Lexical Bundles were reduced to 88 principal
components
Text Mining and Continuous Assurance
Component 1
principles generally accepted in
accounting principles generally accepted
generally accepted in the
accepted in the united
with accounting principles generally
affect the reported amounts
reported amounts of assets
that affect the reported
to make estimates and
factors that could cause
actual results to differ
results to differ materially
of assets and liabilities
actual results may differ
to differ materially from
differ materially from those
forward looking statements this
in the united states
allowance for doubtful accounts
are expected to be
company believes that the
Text Mining and Continuous Assurance
Component 1
with accounting principles generally accepted in the united states
that affect the reported amounts of assets and liabilities
are expected to be
company believes that the
to make estimates and
factors that could cause
forward looking statements this
allowance for doubtful
accounts
actual results to
actual results may differ materially from those
“GAAP and expected results”
Text Mining and Continuous Assurance
Component 2
have a material adverse
material adverse effect on
a material adverse effect
adverse effect on the
business financial condition and
could have a material
effect on the company's
can be no assurance
be no assurance that
there can be no
assurance that the company
of one or more
the company will be
no assurance that the
of the company's products
that the company will
and will continue to
Text Mining and Continuous Assurance
Component 2
could have a material adverse effect on the company's
there can be no assurance that the company will be
business financial condition
and
of one or more
of the company's products
and will continue to
“Could be bad”
Text Mining and Continuous Assurance
Classification Results
• Discriminant Analysis
– 71% of cross-validated cases were correctly
classified
Discriminating factor (PC) Beta Discriminating factor (PC) Beta
Impact and exposure .464 Price and offsets .335
Material difference -.421 COGS and change
in accounting
principle
.330
Common stock and
adverse affects
.412 Fair market value .313
Going concerns .363 Exercise of stock
Options
.298
New product
introductions
.339 Number of Factors -.287
Text Mining and Continuous Assurance
Confusion Matrix
Predicted Class
Fraudulent Non-Fraudulent
Actual Class
Fraudulent 70 31
Non-Fraudulent 28 73
Text Mining and Continuous Assurance
Confusion Matrix Results
FNFPTNTP
TNTPAccuracy
TNFP
FPFPR
FNTP
TPTPR
FPTP
TPecision
Pr Precision = .714
True Positive Rate = .693
False Positive Rate = .277
Accuracy = .708
Predicted Class
Fraudulent Non-Fraudulent
Actual
Class
Fraudulent 70 (TP) 31 (FN)
Non-Fraudulent 28 (FP) 73 (TN)
Text Mining and Continuous Assurance
Conclusion
• Lexical bundles have more contextual meaning than unigrams
– Results are easier to interpret
• Lexical bundles may be used to classify documents
• Lexical bundle analysis can be used in any type of textual dataset
• This process and other text mining processes can be integrated into
continuous assurance solutions
– Rapid identification of suspicious documents