semantic infrastructure workshop development tom reamy chief knowledge architect kaps group...

44
Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com

Post on 19-Dec-2015

225 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Semantic Infrastructure Workshop Development

Tom ReamyChief Knowledge Architect

KAPS Group

Knowledge Architecture Professional Services

http://www.kapsgroup.com

Page 2: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

2

Agenda

Text Analytics – Foundation– Features and Capabilities

Evaluation of Text Analytics – Start with Self-Knowledge – Features and Capabilities – Filter, Proof of Concept / Pilot

Text Analytics Development– Progressive Refinement– Categorization, Extraction, Sentiment– Case Studies – Best Practices

Page 3: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

3

Semantic Infrastructure - FoundationText Analytics Features Noun Phrase Extraction

– Catalogs with variants, rule based dynamic– Multiple types, custom classes – entities, concepts, events– Feeds facets

Summarization– Customizable rules, map to different content

Fact Extraction– Relationships of entities – people-organizations-activities– Ontologies – triples, RDF, etc.

Sentiment Analysis– Rules – Objects and phrases – positive and negative

Page 4: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

4

Semantic Infrastructure - Foundation Text Analytics Features Auto-categorization

– Training sets – Bayesian, Vector space– Terms – literal strings, stemming, dictionary of related terms– Rules – simple – position in text (Title, body, url)– Semantic Network – Predefined relationships, sets of rules– Boolean– Full search syntax – AND, OR, NOT– Advanced – NEAR (#), PARAGRAPH, SENTENCE

This is the most difficult to develop Build on a Taxonomy Combine with Extraction

– If any of list of entities and other words

Page 5: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

5

Page 6: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

6

Page 7: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

7

Page 8: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

8

Page 9: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

9

Page 10: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

10

Page 11: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

11

Page 12: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

12

Page 13: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

13

Page 14: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

14

Evaluating Text Analytics Software Start with Self Knowledge

Strategic and Business Context Info Problems – what, how severe Strategic Questions – why, what value from the taxonomy/text

analytics, how are you going to use it Formal Process - KA audit – content, users, technology, business

and information behaviors, applications - Or informal for smaller organization,

Text Analytics Strategy/Model – forms, technology, people– Existing taxonomic resources, software

Need this foundation to evaluate and to develop

Page 15: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

15

Evaluating Text Analytics Software Start with Self Knowledge

Do you need it – and what blend if so? Taxonomy Management Stand alone

– Multiple taxonomies, languages, authors-editors Technology Environment – ECM, Enterprise Search – where is it

embedded Publishing Process – where and how is metadata being added –

now and projected future– Can it utilize auto-categorization, entity extraction, summarization

Is the current search adequate – can it utilize text analytics? Applications – text mining, BI, CI, Alerts?

Page 16: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Evaluating Text Analytics Software Team - Interdisciplinary

IT – Large software purchase, needs assessment• Text Analytics is different – semantics• Construction company designing your house

Business – Understand the business needs• Don’t understand information • Restaurant owner doing the cooking

Library - know information, search• Don’t understand the business, non-information experts• Accountant doing financial strategy

Team – combination of consulting and internal

16

Page 17: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Semantic Infrastructure - Foundation Design of the Text Analytics Selection Team

Interdisciplinary Team, led by Information Professionals– IT – software experience, budget, support tests– Business – understand business and requirements– Library – understand information structure, understanding of

search semantics and functionality Much more likely to make a good decision

– This is not a traditional IT software evaluation – semantics Create the foundation for implementation

17

Page 18: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Evaluating Text Analytics Software Evaluation Process & Methodology: Two Phases

Phase I – Traditional Software Evaluation – Filter One- Ask Experts - reputation, research – Gartner, etc.

• Market strength of vendor, platforms, etc.– Filter Two - Feature scorecard – minimum, must have, filter to

top 3– Filter Three – Technology Filter – match to your overall scope

and capabilities – Filter not a focus– Filter Four – In-Depth Demo – 3-6 vendors

Phase II - Deep POC (2) – advanced, integration, semantics

18

Page 19: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Evaluating Text Analytics Software Phase II - Proof Of Concept - POC

4-6 weeks POC – bake off / or short pilot Measurable Quality of results is the essential factor Real life scenarios, categorization with your content 2-3 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of

content Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities – have

to determine at POC time Taxonomy Developers – expert consultants plus internal taxonomists

19

Page 20: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Evaluating Text Analytics Software Phase II – POC: Range of Evaluations

Basic Question – Can this stuff work at all? Auto-categorization to existing taxonomy – variety of content

– Essential Issue is complexity of language Clustering – automatic node generation Summarization Entity extraction – build a number of catalogs – design which ones

based on projected needs – example privacy info (SS#, phone, etc.) Entity example –people, organization, methods, etc.

– Essential issue is scale and disambiguation Evaluate usability in action by taxonomists

20

Page 21: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

21

Text Analytics Evaluation: Case StudySelf-Knowledge

Platform – range of capabilities– Categorization, Sentiment analysis, etc.

Technical– API’s, Java based, Linux run time– Scalability – millions of documents a day– Import-Export – XML, RDF

Total Cost of Ownership Vendor Relationship - OEM Usability, Multiple Language Support Team – 3 KAPS - Information 5-8 Amdocs – SME - business, Technical.

Page 22: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics Evaluation: Case Study Phase I – Case Study

– Attensity– SAP – Inxight– Clarabridge– ClearForest– Concept Searching– Data Harmony / Access

Innovations– Expert Systems– GATE (Open Source)– IBM

– Lexalytics– Multi-Tes– Nstein– SAS – SchemaLogic– Smart Logic– Content Management – Enterprise Search– Sentiment Analysis Specialty– Ontology Platforms

22

Page 23: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics Evaluation: Case Study Case Study: Telecom Service

Company History, Reputation Full Platform –Categorization,

Extraction, Sentiment Integration – java, API-SDK,

Linux Multiple languages Scale – millions of docs a day Total Cost of Ownership Ease of Development - new Vendor Relationship – OEM,

etc.

Expert Systems IBM SAS - Teragram Smart Logic

Option – Multiple vendors – Sentiment & Platform

IBM and SAS – finalists

23

Page 24: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

24

Text Analytics Evaluation: Case Study POC Design Discussion: Evaluation Criteria

Basic Test Design – categorize test set– Score – by file name, human testers

Categorization – Call Motivation– Accuracy Level – 80-90%– Effort Level per accuracy level

Sentiment Analysis– Accuracy Level – 80-90%– Effort Level per accuracy level

Quantify development time – main elements Comparison of two vendors – how score?

– Combination of scores and report

Page 25: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics Evaluation: Case Study Phase II – POC: Risks

CIO/CTO Problem –This is not a regular software process Language is messy not just complex

– 30% accuracy isn’t 30% done – could be 90% Variability of human categorization / expression

– Even professional writers – journalists examples Categorization is iterative, not “the program works”

– Need realistic budget and flexible project plan Anyone can do categorization

– Librarians often overdo, SME’s often get lost (keywords) Meta-language issues – understanding the results

– Need to educate IT and business in their language

25

Page 26: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics POC OutcomesCategorization Results

SAS IBM

Recall-Motivation 92.6 90.7

Recall-Actions 93.8 88.3

Precision – Mot. 84.3

Precision-Act 100

Uncategorized 87.5

Raw Precision 73 46

26

Page 27: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics POC OutcomesVendor Comparisons

Categorization Results – both good, edge to SAS on precision– Use of Relevancy to set thresholds

Development Environment– IBM as toolkit provides more flexibility but it also increases

development effort Methodology – IBM enforces good method, but takes more

time– SAS can be used in exactly the same way

SAS has a much more complete set of operators – NOT, DIST, START

27

Page 28: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Text Analytics POC OutcomesVendor Comparisons - Functionality

Sentiment Analysis – SAS has workbench, IBM would require more development– SAS also has statistical modeling capabilities

Entity and Fact extraction – seems basically the same– SAS and use operators for improved disambiguation –

Summarization – SAS has built-in– IBM could develop using categorization rules – but not clear that

would be as effective without operators

Conclusion: Both can do the job, edge to SAS Now the fun begins - development

28

Page 29: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

29

Text Analytics Development: Foundation

Articulated Information Management Strategy (K Map)– Content and Structures and Metadata– Search, ECM, applications - and how used in Enterprise– Community information needs and Text Analytics Team

POC establishes the preliminary foundation– Need to expand and deepen– Content – full range, basis for rules-training– Additional SME’s – content selection, refinement

Taxonomy – starting point for categorization / suitable? Databases – starting point for entity catalogs

Page 30: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

30

Text Analytics DevelopmentEnterprise Environment – Case Studies

A Tale of Two Taxonomies – It was the best of times, it was the worst of times

Basic Approach– Initial meetings – project planning– High level K map – content, people, technology– Contextual and Information Interviews– Content Analysis– Draft Taxonomy – validation interviews, refine– Integration and Governance Plans

Page 31: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

31

Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets

Taxonomy of Subjects / Disciplines:– Science > Marine Science > Marine microbiology > Marine toxins

Facets:– Organization > Division > Group– Clients > Federal > EPA– Instruments > Environmental Testing > Ocean Analysis > Vehicle– Facilities > Division > Location > Building X– Methods > Social > Population Study– Materials > Compounds > Chemicals– Content Type – Knowledge Asset > Proposals

Page 32: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

32

Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets

Project Owner – KM department – included RM, business process

Involvement of library - critical Realistic budget, flexible project plan Successful interviews – build on context

– Overall information strategy – where taxonomy fits Good Draft taxonomy and extended refinement

– Software, process, team – train library staff– Good selection and number of facets

Final plans and hand off to client

Page 33: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

33

Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets

Taxonomy of Subjects / Disciplines:– Geology > Petrology

Facets:– Organization > Division > Group– Process > Drill a Well > File Test Plan– Assets > Platforms > Platform A– Content Type > Communication > Presentations

Page 34: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

34

Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets

Environment Issues– Value of taxonomy understood, but not the complexity

and scope– Under budget, under staffed– Location – not KM – tied to RM and software

• Solution looking for the right problem– Importance of an internal library staff– Difficulty of merging internal expertise and taxonomy

Page 35: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

35

Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets

Project Issues– Project mind set – not infrastructure– Wrong kind of project management

• Special needs of a taxonomy project• Importance of integration – with team, company

– Project plan more important than results• Rushing to meet deadlines doesn’t work with semantics as

well as software

Page 36: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

36

Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets

Research Issues– Not enough research – and wrong people– Interference of non-taxonomy – communication– Misunderstanding of research – wanted tinker toy connections

• Interview 1 implies conclusion A

Design Issues– Not enough facets– Wrong set of facets – business not information– Ill-defined facets – too complex internal structure

Page 37: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

37

Text Analytics Development Conclusion: Risk Factors

Political-Cultural-Semantic Environment – Not simple resistance - more subtle

• – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations

Understanding project scope Access to content and people

– Enthusiastic access Importance of a unified project team

– Working communication as well as weekly meetings

Page 38: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

38

Text Analytics DevelopmentCase Study 2 – POC – Telecom Client

Demo of SAS - / Enterprise Content Categorization

Page 39: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

39

Text Analytics Development Best Practices - Principles

Importance of ongoing maintenance and refinement Need dedicated taxonomy team working with SME’s Work with application developers to incorporate text

analytics into new applications Importance of metrics and feedback

– Software and social Questions:

– What are important subjects (and changes)– What information do they need?– How is their information related to other silos?

Page 40: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

40

Text Analytics Development Best Practices - Principles

Process– Realistic Budget – not a nice to have add on– Flexible Project plan - semantics are complex and messy

• Time estimates are difficult, object success measures are too– Transition from development to maintenance is fluid

Resources– Interdisciplinary Team is essential– Importance of communication – languages– Merging internal and external expertise

Page 41: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

41

Text Analytics Development Best Practices - Principles

Categorization taxonomy structure– Tradeoff of depth and complexity of rules– Multiple avenues – facets, terms, rules, etc.

• No right balance– Recall-precision balance is application specific– Training sets of starting points, rules rule– Need for custom development

Technology– Basic integration – XML– Advanced –combine unstructured and structured in new ways

Page 42: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

42

Text Analytics Development Best Practices – Risk Factors

Value understood, but not the complexity and scope Project mindset – software project and then done Not enough research on user information needs, behaviors

– Talking to the right people and asking the right questions– Getting beyond “All of the Above” surveys

Not enough resources, wrong resources Enthusiastic access to content and people Bad design – starting with the wrong type of taxonomy Categorization is not library science

– More like cognitive anthropology

Page 43: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

43

Semantic Infrastructure Development Conclusion

Text Analytics is the Foundation for Semantic infrastructure Evaluation of Text Analytics – different than IT software

– POC – essential, foundation of development– Difference of taxonomy and categorization

• Concepts vs. text in documents

Enterprise Context – strategic, self-knowledge– Infrastructure resource, not a project– Interdisciplinary Team and applications

Integration with other initiatives and technologies– Text Mining, Data Mining, Sentiment & beyond, Everything!

Page 44: Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services

Questions?

Tom [email protected]

KAPS Group

Knowledge Architecture Professional Services

http://www.kapsgroup.com