Download - Empowering the Publishing Process with Semantic Technologies

1

www.innodata-isogen.com

Empowering the Publishing Process with Semantic Technologies

Stephen CohenPrincipal Consultant

O’Reilly Tools of Change Conference23 February 2010

2

Agenda

• Overview• Semantic technologies• Case studies• Benefits and challenges• Questions

2

3

Innodata Isogen – Who We Are

Innodata Isogen provides knowledge, production, technology and consulting services to the world’s leading media, publishing and information services companies

New Jersey

ParisIsraelDelhi

CebuManila

Colombo

6,500 global staff

Dallas

London

We specialize in publishing, to help our clients to:• lower total cost of ownership for their content supply chain• re-engineer business processes• multi-shore services to lower cost, manage risk and balance

the cost / quality ratio• combine content and technology outsourcing add value

Our clients include• leading scholarly, business and legal publishers• secondary publishers (content aggregators)• agencies of the U.S. Department of Defense• major aerospace manufacturers

3

4

Overview

• Semantic technologies are often used to more effectively monetize content and improve the customer experience on the Web– semantic advertising– semantic search

• They have also been used effectively throughout the publishing process

• Today we will talk about companies that are using semantic technologies and text mining to process content better, faster, cheaper

4

5

What Do Publishers Have in Common?• They all want to deliver information better, faster,

cheaper• Better

– offer the information customers and users want and need (focused)

– make it easier for customers to discover new information and relationships between information

• Faster– get it in the hands of customers ahead of your competition

(when they need it)• Cheaper

– do it in the most cost effective way possible

5

6

Semantic Analysis Tools Can Help• Across the content supply chain• Better

– more accurate, consistent content tagging, indexing, abstracting, linking

• Faster– find out sooner about new information (e.g.,

announcements, legal opinions, rules changes)– (semi) automate content enrichment– increase throughput

• Cheaper– deploy resources most cost effectively (do more with less)

6

7

Semantic Technologies: Some Characteristics

• Briefly, semantic technologies are algorithms that seek to model the associative processes that humans perform to extract meaning from information

• Knowing a little bit about “the man behind the curtain” can help when it comes to deciding which approach is a good fit for your company’s needs

• They can be rules-based, use statistical analysis, use semantic and linguistic clustering, etc.

• Not surprisingly, there are many approaches to modeling and each has its strengths and weaknesses

7

8

Rules-Based Text Analysis

• Precisely defines criteria by which a document belongs to a category• Matches terms in a thesaurus to words in content• Typically uses “if-then-else” rules• Relative easy to deploy; start with simple rules and enhance over time• Rules can get complex, difficult to maintain

8

Word = shrub?

Assign Category = ‘bush’

Word = BushAND

within 4 words of President?

Assign Category = ‘chief executive’

doc.type = email?

Assign Category = ‘internal communication’

9

Statistical Analysis

• Word frequency• Relative placement of words, groupings• Distance between words in a document• Pattern analysis• Co-occurrence of terms to find clumps or clusters of closely

related documents• Makes assignments to categories based on a set of training

documents • Requires more time to deploy due to need to select a

representative set of documents for training the tool• Accuracy of the semantic analysis will depend on how well the

training documents have been chosen

9

10

Semantic and Linguistic Clustering

• Concept extraction• Language dependent • Documents clustered or grouped depending on meaning of

words using thesauri, parts-of-speech analyzers, rule-based & probabilistic grammar, etc.

• Analyzes structure of sentences– analysis of words - prefixes, suffixes, roots– word-level analysis including parts of speech– analyzes structure & relationships between words in a sentence– possible meanings of a sentence; enhanced by statistical analysis

10

11

The Content Supply Chain

11

• We view the publishing process in terms of a supply chain• It begins with content acquisition through conversion and

enhancement, on to product assembly and, lastly, to product publishing and distribution

• Using semantic tools has an impact on roles and responsibilities, workflows and the way content is processed at each stage of the content supply chain

• Semantic tools and text mining are used at different stages of the editorial and production process

Supported by: DTDs; content, digital asset repositories; policies; workflow management; metadata; rights management; resources; product definitions; internal and external systems

Source /

Create

Convert / Structure

Normalize

Store / Manag

e

Edit /

Enhanc

e

Produc

t Assembly

Publish / Distribute

12

Controlled vocabulary and authority list management; taxonomy managers; knowledge management

Linking; entity extraction; citations; classification , machine aided indexing; contextual meaning

Extract content for tagging; identify not only document structure but document meaning; structure unstructured content

Intelligent agents for targeted retrieval (content federation); “acquire what is new or changed from sites I am interested in”

Abstracting, auto-summarization (e.g., synopses, headnotes)

Custom publishing; ‘Synthetic documents’

Content delivery for multiple output channels and product formats

Semantic Tools in the Content Supply Chain

12

Convert / Structure

Normalize

Store /Manage

Edit /Enhance

Product Assembly

Publish / Distribute

Source / Create

13

Case Studies

13

14

Preview of Case Studies

• Rules-based auto-classification• Document analysis and entity linking• Auto-summarization• Product assembly• Custom information feeds

14

15

Case StudyRules-based Auto-classification

15

16

Rules-based Auto-classification

RULES BASE

Apply rules to classify content against taxonomy

AUTO-CLASSIFICATION

System tracks rules usage (which ones used; frequency)

Set-up

INDEXER REVIEWINDEXER

Accepts, rejects, adds, classification termsReviews rules system applied that yielded wrong classificationFlag problems to rules builder; suggest new terms

SYSTEMTracks rules that generated

incorrect classifications

INDEXER

Add/remove terms; Create groupings; Map terms

Automatic update of rules to reflect changes in taxonomy

TAXONOMY MANAGER

SYSTEM

Baseline Test Set

DEFINE CLASSIFICATION RULES RULES MANAGEMENT

Review usage statistics Rules used, not used; add, modify, delete rules

Indexer defines classification rules

Test & adjust rules

16

17

Case StudyDocument Analysis and Entity Linking

17

18

Document Analysis and Entity Linking

• Focus is on document analysis and entity linking in editorial workflow

• Subsidiary of a global legal publishing house– content base of 3.5 million cases, related documents– manages over 17 million citations– updates of case law processed daily– cases growing at 20% per annum

• Challenges– avoid processes performed manually by individuals– allow the user to select and filter the information needed for their job– take into account an increasing number of legal information sources

• Describes target configuration but not yet fully realized

18

19

Goals for the New Process• Aid the process of knowledge extraction and storage

– identify legal sources (e.g., official publication, case law decision)

– extract legal citations (which source is cited and why?)– populate a knowledge base and cyclically enrich the content

• Process each piece of information one time– normalize, tag, enrich, link, form concepts, etc.

• Build standardized common knowledge base for use throughout the editorial and production process and by downstream by end-users

• Maintain consistent thesauri, ontologies, taxonomies and provide a mechanism for their management and updating

19

20

Document Analysis and Linking Process

20

Lega

l edi

tors

SEARCH AND NAVIGATION SERVICES

Entity error

REVIEW AND QC ENRICHED CONTENT

Link error

Concept error

Use search, navigation tools to review, identify, and correct

Libr

aria

n

LIST AND RULES MAINTENANCE

Weekly review of exception reports

KNOWLEDGE MANAGEMENT

Aut

omat

ed S

eman

tic

Ana

lysi

s

Domain-specific lists

for entity recognition

Text mining rules

Baseline Test Set

Test text analysis tool

DEFINITION PHASE

Entity extraction

Iterative application of rules

Tag content

Linking

AUTOMATED TEXT ANALYSIS

20

21

Benefits of the New Process• Workflow

– a semi-automated process– editors review QC output from text mining tool to enhance and correct as

necessary– analysis and linking by automated text analysis tool– parallel processing in text analysis tool– analysis, referencing and linking became part of the same workflow

• Roles and responsibilities– editors no longer need to be experts in mark-up languages; content is tagged

automatically– low value editorial tasks handled by text analysis tool– existing staff can focus on high value tasks– new role to maintain and enhance semantic lists and text mining tool rules

• Content– quality document analysis improves through enhancements to the lists and rules

used by the text mining tool– able to federate metadata across multiple content management systems– same knowledge base and text mining tool integrated into online products

21

22

Case StudyAuto-summarization

22

23

Auto-summarization – Major Newspaper

Manual summarizationOutsource or in-house experts

Content in

Rules BaseExtent of automation depends on article importance

OR

OR

OR

Document AnalysisSource; type; format;

content

Auto-summarization

Auto-summarization (draft version)

Expert review and edit (final version)

• Document zones• Rules: semantics; dictionary;

complex grammar rules• Section weightings• Sentence position• Relative importance of sentences• Markers for start of sections,

paragraphs, sentences• Sentence length of summary

Administrator monitors, improves rules set based on usage

23

24

Case StudyProduct Assembly

24

25

Product Assembly

Content Repository

XML Content Store Rich Media

Analyze / Classify / Enhance - Editorial

Source / Capture

Convert / Normalize

NewContentProcess

Select content(XQuery)




Extract Product Content From Repository

RenderFOSI; XSLFO;

ProprietaryRender

XSLT; CSS; RSSRender Render

WCSS; ProprietaryFormatProduct

25

26

Case StudyCustom Information Feeds

26

27

Custom Information Feeds

27

End Users Content

BASEBALLFO

OTBALL

HOCKEY

PROREC

COLLEGE

HIGH SCHO

OL

SOCCER

SCORES

PLAYERS

RULES

STATS

SCHEDS

Repository

NEWS

STANDINGS

RICH MEDIA

PEOPLE

XML

Delivery

REAL-TIME UPDATES

TARGETED INFO

REAL-TIME FEEDS

ENRICHED EMAIL

REAL-TIME FEEDS

ENRICHED EMAIL

28

Benefits of Using Semantic Technologies • People

– minimize high-value resources performing commodity tasks– editors focus on real editorial added value; no need to be concerned about markup– increased capacity without increasing headcount– novice indexers come up to speed quicker

• Process– reduced processing time due to automation– sequential tasks can be performed in one step– products can be more targeted to specific customer needs– parts can be outsourced

• Content– richer more consistent classification, linking, summarization, semantic tagging– common controlled vocabularies maintained and applied across entire content base– same content can be classified and summarized along more dimensions to serve

different customer groups– greater value can be extracted from unstructured content with text mining and

semantic analysis– taxonomy managers support a rigorous approach to maintenance and updating

28

29

Challenges Using Semantic Technologies • People

– retrain resources for new roles (rules builder, taxonomy manager, etc.) is time consuming

– level of accuracy depends on ability of editors to write logical rules• Process

– time required to refine rules and train analysis engine can be extensive (some report 12-18 months)

– productivity improvements are a function of thesaurus structure, rule-builder’s skill level, document type; the more complex any of these are the longer it takes to achieve return on investment

• Content– automated content analysis doesn’t match up to the analytical skills of

trained subject area experts (at least in some highly technical disciplines)– some find it difficult to measure the impact of indexing consistency– lower quality when there is fully automated machine aided indexing with

no follow-on QC by subject area experts

29

30

Questions

30

31

www.innodata-isogen.com

Stephen CohenPrincipal [email protected]+1 (201) 371-8044

WWW.INNODATA-ISOGEN.COM

Proprietary and Confidential

THANK YOU

Innodata Isogen, Inc.Three University PlazaHackensack, NJ 07601+1 (201) 371-2828www.innodata-isogen.com

Download - Empowering the Publishing Process with Semantic Technologies

Top Related