text analytics workshop development tom reamy chief knowledge architect kaps group knowledge...

30
Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com

Post on 21-Dec-2015

229 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics Workshop Development

Tom ReamyChief Knowledge Architect

KAPS Group

Knowledge Architecture Professional Services

http://www.kapsgroup.com

Page 2: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

2

Agenda

Development - Foundation

Case Study 1 – Internet News Case Study 2 – Tale of two taxonomies Case Study 3 – Software Evaluation and Beyond Exercises

Page 3: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

3

Text Analytics Development: Foundation

Articulated Information Management Strategy (K Map)– Content and Structures and Metadata– Search, ECM, applications - and how used in Enterprise– Community information needs and Text Analytics Team

POC establishes the preliminary foundation– Need to expand and deepen– Content – full range, basis for rules-training– Additional SME’s – content selection, refinement

Taxonomy – starting point for categorization / suitable? Databases – starting point for entity catalogs

Page 4: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

4

Knowledge Architecture Audit:Knowledge MapProject Foundation

Contextual Interviews

Information

Interviews

App/Content

Catalog

User Survey Strategy

Document

Meetings, work groups

Overview

High Level:

Process

Community

Info behaviors of Business processes

Technology and content

All 4 dimensions

Meetings, work groups

General Outline

Broad Context

Deep Details

Deep Details

Complete Picture

New

Foundation

Page 5: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

5

Taxonomy Development Process:Progressive RefinementTaxonomy Model

Information

Interviews

Content Analysis

Refine Map Community

Governance Plan

Buy/Find work groups

Overview

Info behaviors, Card Sorts

Bottom Up Prototypes

Interviews Evaluate

Refine Interviews

Develop, Refine

General Outline

Preliminary Taxonomy

Taxonomy 1.0

Taxonomy 1.0-1.9

Tax 2.0 Taxonomy

Page 6: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

6

Text Analytics Development: Categorization Process

Starter Taxonomy– If no taxonomy, develop initial high level (see Chart)

Analysis of taxonomy – suitable for categorization – Structure – not too flat, not too large– Orthogonal categories

Content Selection– Map of all anticipated content – Selection of training sets – if possible– Automated selection of training sets – taxonomy nodes as

first categorization rules – apply and get content

Page 7: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

7

Text Analytics Development: Categorization Process

First Round of Categorization Rules Term building – from content – basic set of terms that

appear often / important to content Add terms to rule, apply to broader set of content Repeat for more terms – get recall-precision “scores” Repeat, refine, repeat, refine, repeat Get SME feedback – formal process – scoring Get SME feedback – human judgments Text against more, new content Repeat until “done” – 90%?

Page 8: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

8

Text Analytics Development: Entity Extraction Process

Facet Design – from KA Audit, K Map Find and Convert catalogs:

– Organization – internal resources– People – corporate yellow pages, HR– Include variants – Scripts to convert catalogs – programming resource

Build initial rules – follow categorization process– Differences – scale, “score”– Recall – find all entities– Precision – correct assignment to entity class– Issue – disambiguation – Ford company, person, car

Page 9: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

9

Case Study - Background

Inxight Smart Discovery Multiple Taxonomies

– Healthcare – first target– Travel, Media, Education, Business, Consumer Goods,

Content – 800+ Internet news sources– 5,000 stories a day

Application – Newsletters – Editors using categorized results– Easier than full automation

Page 10: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

10

Case Study - Approach

Initial High Level Taxonomy – Auto generation – very strange – not usable– Editors High Level – sections of newsletters– Editors & Taxonomy Pro’s - Broad categories & refine

Develop Categorization Rules– Multiple Test collections– Good stories, bad stories – close misses - terms

Recall and Precision Cycles– Refine and test – taxonomists – many rounds – Review – editors – 2-3 rounds

Repeat – about 4 weeks

Page 11: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

11

Page 12: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

12

Page 13: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

13

Page 14: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

14

Page 15: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

15

Page 16: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

16

Page 17: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

17

Page 18: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

18

Case Study - Issues

Taxonomy Structure– Aggregate nodes vs. independent nodes– Children Nodes – subset – rare

Depth of taxonomy and complexity of rules– Trade-off need to update and usefulness of categories

Multiple avenues - Facets – source – New York Times – can put into rules or make it a facet to filter results

When to use filter or terms – experimental Recall more important than precision – editors role

Page 19: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

19

Case Study – Lessons Learned

Combination of SME and Taxonomy pros Combination of Features – Entity extraction, terms,

Boolean, filters, facts Training sets and find similar are weakest

– Somewhat useful during development for terms

No best answer – taxonomy structure, format of rules– Need custom development

Plan for ongoing refinement This stuff actually works!

Page 20: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

20

Enterprise Environment – Case Studies

A Tale of Two Taxonomies – It was the best of times, it was the worst of times

Basic Approach– Initial meetings – project planning– High level K map – content, people, technology– Contextual and Information Interviews– Content Analysis– Draft Taxonomy – validation interviews, refine– Integration and Governance Plans

Page 21: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

21

Enterprise Environment – Case One – Taxonomy, 7 facets

Taxonomy of Subjects / Disciplines:– Science > Marine Science > Marine microbiology > Marine toxins

Facets:– Organization > Division > Group– Clients > Federal > EPA– Instruments > Environmental Testing > Ocean Analysis > Vehicle– Facilities > Division > Location > Building X– Methods > Social > Population Study– Materials > Compounds > Chemicals– Content Type – Knowledge Asset > Proposals

Page 22: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

22

Enterprise Environment – Case One – Taxonomy, 7 facets

Project Owner – KM department – included RM, business process

Involvement of library - critical Realistic budget, flexible project plan Successful interviews – build on context

– Overall information strategy – where taxonomy fits Good Draft taxonomy and extended refinement

– Software, process, team – train library staff– Good selection and number of facets

Final plans and hand off to client

Page 23: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

23

Enterprise Environment – Case Two – Taxonomy, 4 facets

Taxonomy of Subjects / Disciplines:– Geology > Petrology

Facets:– Organization > Division > Group– Process > Drill a Well > File Test Plan– Assets > Platforms > Platform A– Content Type > Communication > Presentations

Page 24: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

24

Enterprise Environment – Case Two – Taxonomy, 4 facets

Environment Issues– Value of taxonomy understood, but not the complexity

and scope– Under budget, under staffed– Location – not KM – tied to RM and software

• Solution looking for the right problem

– Importance of an internal library staff– Difficulty of merging internal expertise and taxonomy

Page 25: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

25

Enterprise Environment – Case Two – Taxonomy, 4 facets

Project Issues– Project mind set – not infrastructure– Wrong kind of project management

• Special needs of a taxonomy project• Importance of integration – with team, company

– Project plan more important than results• Rushing to meet deadlines doesn’t work with semantics as

well as software

Page 26: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

26

Enterprise Environment – Case Two – Taxonomy, 4 facets

Research Issues– Not enough research – and wrong people– Interference of non-taxonomy – communication– Misunderstanding of research – wanted tinker toy connections

• Interview 1 implies conclusion A

Design Issues– Not enough facets– Wrong set of facets – business not information– Ill-defined facets – too complex internal structure

Page 27: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

27

Taxonomy DevelopmentConclusion: Risk Factors

Political-Cultural-Semantic Environment – Not simple resistance - more subtle

• – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations

Understanding project scope Access to content and people

– Enthusiastic access

Importance of a unified project team– Working communication as well as weekly meetings

Page 28: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

28

Text Analytics DevelopmentCase Study 3 – POC – Government Agency

Demo of SAS – Teragram / Enterprise Content Categorization

Page 29: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

29

Conclusion

Enterprise Context – strategic, self knowledge Importance of a good foundation

– Importance of Taxonomy Structure – mapped to use– POC a head start on development

Importance of Text Analytics Vision / Strategy– Infrastructure resource, not a project

Balance of expertise and local knowledge Importance of Usability for refinement cycles Difference of taxonomy and categorization

– Concepts vs. text in documents

Page 30: Text Analytics Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Questions?

Tom [email protected]

KAPS Group

Knowledge Architecture Professional Services

http://www.kapsgroup.com