words that matter

15
1 Words that Matter Application of Text Analytics

Upload: miguel-angel-castillo-cpa-crma-cgfm-cisa

Post on 09-Jan-2017

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Words that Matter

1

Words that Matter Application of Text Analytics

Page 2: Words that Matter

Topics

Business Questions Success Strategy Project Steps Technical Solution Analytic Requirements Results Business Application Lessons Learned

2

Page 3: Words that Matter

Business Questions

How well has the Office of the Inspector General (OIG) fulfilled its mission?

How can the OIG prioritize final rule reviews? • Did common terms in public comments appear in final rules? • What sentiment did public comments express?

3

Page 4: Words that Matter

Success Strategy Sizing the Project

• Data – Available, Processable, Standardized • Security Concerns – factor in information security governance

Seeking an Executive Champion • Do they support the answer value? • To what extent will they fund the project (budgetary

considerations)?

Repeating a Quick Win • Is the project repeatable to gain support for subsequent

projects? 4

Page 5: Words that Matter

Engaged management buy-in for questions Assessed security concerns for public facing data Contracted technical support and quantitative and

qualitative statistical expertise Used Amazon Web Services for infrastructure support Used Amazon Marketplace for selecting text mining

tool Documented repeatable technical tasks

5

Project Steps

Page 6: Words that Matter

Technical Solution

6

Presenter
Presentation Notes
Good morning, Welcome to the PAWGOV conference. My name is Antuane Allen. I’m an analyst with Sanametrix. Sanametrix is a certified AWS partner contracted to provide technical guidance. To address the business questions, 2 two solutions were utilized. Within the AWS marketplace MarkLogic. IBM AlchemyAPI outside of the AWS marketplace. Utilizing Amazon Web Services to host a MarkLogic software implementation, and the AlchemyAPI service from IBM Bluemix, large disconnected data sets were processed via the REST API endpoints available from each respective service.
Page 7: Words that Matter

MarkLogic – platform enabled ability to parse unstructured text and calculate term frequencies

Term Frequency Normalization – where N is equal to the total number of terms within a document or set of documents

𝑡𝑡𝑡𝑡 𝑡𝑡 =𝑤𝑤𝑖𝑖𝑡𝑡𝑁𝑁

Gap Concept – differences between normalized frequencies of baseline terms and corpus documents

7

Analytic Requirement #1

Presenter
Presentation Notes
A challenge with addressing the business question, was how to take unstructed text data and structure it in a format that would be useful in drawing insight. Business queston 2a asks “Did common terms in public comments appear in final rules?” To address this, an application which would allow calculation of word counts of text data Was needed. In addition, a solution which can work with different document formats. After evaluating several options MarkLogic software was determined to be the appropriate solution. Utilizing the MarkLogic software, each final rule document was parsed through and a baseline set of terms were compiled based on the most frequently occurring terms. Stop terms were excluded from the calculation of terms (prepositions, and other terms that have a high frequency however yield no semantic importance). Each baseline term has 2 associated TF. Term Frequency (TF) of that term in the document. The next step was the refining of the baseline. After compiling a baseline set, terms were evaluated relative to their importance and either keep or removed based on the discretion of the subject matter experts.
Page 8: Words that Matter

OIG Standards of Work

Business Question: How well has the Office of the Inspector General (OIG) fulfilled its mission?

Answer: OIG could improve its standards of audit work.

8

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

Baseline Terms

Gap

Presenter
Presentation Notes
The objective of the Audit Strategic Planning Model is to examine the gap between the CFTC OIG strategic objectives and its current focus on those objectives as measured by its previous OIG audit report topics. These topics are expressed through the Office of Data and Technology (ODT) selected corpus of unstructured, public facing, document-based material provided for this report www.cftc.gov
Page 9: Words that Matter

OIG Mission Results

9

-0.015

-0.01

-0.005

0

0.005

0.01

Gap

Baseline Terms

risk

Audit Mission Term Gap Analysis

“Risk” stood out for key mission terms. This suggests that the OIG generally balances workload to meet its mission. Since “risk” is typically associated with “control” work, the OIG

either has to emphasize more internal control work or the impact of the work.

Presenter
Presentation Notes
Of the seven key terms associated with the audit strategic mission, only one of the terms, “risk,” had a term frequency that was greater in the baseline documents than in OIG corpus documents. “Efficiency” had the greatest increase in term frequency proportion from the baseline to the OIG corpus, followed by “economy,” “waste,” “abuse,” “effect/effectiveness,” and “fraud.”
Page 10: Words that Matter

Utilize TeamMate software to standardize audit planning and execution

Emphasize internal control risks with project starts

Emphasize the impact associated with business question

10

Strategic Planning Application

Page 11: Words that Matter

Business Question: Did common terms in public comments appear in final rules?

Answer: Yes, with varying degrees of intensity enabling differentiation.

11

Rule Review Results

0%

20%

40%

60%

80%

100%

75FR55410 76FR41398 76FR43851 76FR53172 76FR71626 76FR80674 77FR20128 77FR30596 77FR42559 81FR636

Gap Distribution

Gap => + 1% Gap <= -1% -1% < Gap < 1%

Presenter
Presentation Notes
Generally most rules had at least 70% of key terms appearing within +/- 1% from the rule document to the public comments. 7 of the 10 rules had a higher percentage of key terms appearing more frequently within the comments than in the rule. Only 2 rules Had a higher percentage of key terms appearing less frequency. One rule had key terms appear proportionally (no difference) between the rule and the public comments.
Page 12: Words that Matter

IBM AlchemyAPI – Natural Language Processing platform, learning algorithm

Scoring Mechanism – Positive, Neutral, Negative

Sentiment Attributes – Mixed Sentiment

Limitations of Exercise • Number of Available Comments for Each Rule • Data Quality – Data Capture, PDF’s, Noise • Document Level vs Entity Level • False Positives

12

Analytic Requirement #2

Presenter
Presentation Notes
sentiment analysis algorithm works by looking for positive and negative words and then it aggregates them to yield output. The document level sentiment is outputted as a score between +1 to -1. A positive score implies positive sentiment and a negative score indicates negative sentiment. The neutral sentiment is scored as zero. Along with sentiment score, the Alchemy API also outputs a score for another indicator, called mixed. A value of 1 for “mixed” indicates the presence of both positive and negative sentiments in the text.
Page 13: Words that Matter

Business Question: What sentiment did public comments express? Answer: The majority of public comments are positive towards

proposed rules.

13

Rule Review Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

76FR41398 76FR43851 76FR53172 76FR80674 77FR20128 77FR30596 77FR42559 81FR636 76FR71626

Sentiment Distribution by Dodd-Frank Rule

positive negative neutral

Presenter
Presentation Notes
Of the 10 rules, there were 2 (81FR636 and 75FR55410) which did not have enough scored comments to confidently rely on the results for analysis. 75FR55410 was processed with errors by AlchemyAPI since many comments were in an unreadable .pdf format. Six of the other 8 have over 50% or more of the comments scored as positive (77FR30596, 77FR20128, 76FR80674, 76FR41398, 76FR43851, 76FR71626). Rule 76FR71626 had the most comments and overall 68.4% of the 13,782 comments analyzed were scored positive. An independent qualitative review was conducted in which subject matter experts sampled over 1100 comments from 76FR71626 and found 90% of the comments scored by AlchemyAPI to be accurate. The rule with highest % of negative comments was 77FR42559 with 79.3% of the 1,389 scored as negative, followed by 76FR43851 with 44.9% of 1,135 comments scored as negative. Nearly, all rules had 90% or more comments with mixed sentiment.
Page 14: Words that Matter

Text mining tools, with some limitations, are useful in prioritizing OIG reviews of final rules.

Three rules in the negative quadrants should be considered for further study.

14

Strategic Planning Application

Negative Positive

Positive

77FR20128 76FR41398 76FR43851 76FR71626 77FR30596

Negative

77FR42559 76FR53172

75FR55410

Sent

imen

t

Term Frequency Gap

81FR636

76FR

8067

4

Presenter
Presentation Notes
The following matrix displays the intersections between positive/negative sentiment and positive/negative term frequencies for each Dodd Frank Rule used in this analysis. If a rule is listed in the negative ‘term frequency gap’ column then a higher proportion of key terms in that rule were mentioned less frequently among the public comments. If a rule is listed in the positive column of the ‘term frequency gap’ then a higher proportion of key terms in that rule were mentioned more frequently among the public comments. For sentiment, if a rule is listed in the positive row then a higher proportion of comments were positive and if a rule is listed in the negative row then a higher proportion of comments were negative. There were four rules that had both positive sentiment and an overall higher proportion of key terms appearing more frequently within their respective public comments. Two rules had more negative sentiment and an overall higher proportion of key terms appearing more frequently within their respective public comments. One rule had more positive sentiment and an overall higher proportion of key terms appearing less frequently within the public comments. Rule 75FR55410 did not have results for sentiment, however did have an overall higher proportion of key terms appear more frequently in the public comments. Rule 81FR636 also did not have results for sentiment. However, it had a higher proportion of overall key terms appearing less frequently within the public comments. Rule 76FR80674 had overall positive sentiment but an equal percentage of terms with positive and negative gaps.
Page 15: Words that Matter

Lessons Learned—Success Strategy

15

?

Sizing the Project • Data – Available, Processable, Standardized • Security Concerns – factor in information security governance

Seeking an Executive Champion • Do they support the answer value? • To what extent will they fund the project (budgetary

considerations)?

Repeating a Quick Win • Is the project repeatable to gain support for subsequent

projects?