1
www.innodata-isogen.com
Empowering the Publishing Process with Semantic Technologies
Stephen CohenPrincipal Consultant
O’Reilly Tools of Change Conference23 February 2010
2
Agenda
• Overview• Semantic technologies• Case studies• Benefits and challenges• Questions
2
3
Innodata Isogen – Who We Are
Innodata Isogen provides knowledge, production, technology and consulting services to the world’s leading media, publishing and information services companies
New Jersey
ParisIsraelDelhi
CebuManila
Colombo
6,500 global staff
Dallas
London
We specialize in publishing, to help our clients to:• lower total cost of ownership for their content supply chain• re-engineer business processes• multi-shore services to lower cost, manage risk and balance
the cost / quality ratio• combine content and technology outsourcing add value
Our clients include• leading scholarly, business and legal publishers• secondary publishers (content aggregators)• agencies of the U.S. Department of Defense• major aerospace manufacturers
3
4
Overview
• Semantic technologies are often used to more effectively monetize content and improve the customer experience on the Web– semantic advertising– semantic search
• They have also been used effectively throughout the publishing process
• Today we will talk about companies that are using semantic technologies and text mining to process content better, faster, cheaper
4
5
What Do Publishers Have in Common?• They all want to deliver information better, faster,
cheaper• Better
– offer the information customers and users want and need (focused)
– make it easier for customers to discover new information and relationships between information
• Faster– get it in the hands of customers ahead of your competition
(when they need it)• Cheaper
– do it in the most cost effective way possible
5
6
Semantic Analysis Tools Can Help• Across the content supply chain• Better
– more accurate, consistent content tagging, indexing, abstracting, linking
• Faster– find out sooner about new information (e.g.,
announcements, legal opinions, rules changes)– (semi) automate content enrichment– increase throughput
• Cheaper– deploy resources most cost effectively (do more with less)
6
7
Semantic Technologies: Some Characteristics
• Briefly, semantic technologies are algorithms that seek to model the associative processes that humans perform to extract meaning from information
• Knowing a little bit about “the man behind the curtain” can help when it comes to deciding which approach is a good fit for your company’s needs
• They can be rules-based, use statistical analysis, use semantic and linguistic clustering, etc.
• Not surprisingly, there are many approaches to modeling and each has its strengths and weaknesses
7
8
Rules-Based Text Analysis
• Precisely defines criteria by which a document belongs to a category• Matches terms in a thesaurus to words in content• Typically uses “if-then-else” rules• Relative easy to deploy; start with simple rules and enhance over time• Rules can get complex, difficult to maintain
8
Word = shrub?
Assign Category = ‘bush’
Word = BushAND
within 4 words of President?
Assign Category = ‘chief executive’
doc.type = email?
Assign Category = ‘internal communication’
9
Statistical Analysis
• Word frequency• Relative placement of words, groupings• Distance between words in a document• Pattern analysis• Co-occurrence of terms to find clumps or clusters of closely
related documents• Makes assignments to categories based on a set of training
documents • Requires more time to deploy due to need to select a
representative set of documents for training the tool• Accuracy of the semantic analysis will depend on how well the
training documents have been chosen
9
10
Semantic and Linguistic Clustering
• Concept extraction• Language dependent • Documents clustered or grouped depending on meaning of
words using thesauri, parts-of-speech analyzers, rule-based & probabilistic grammar, etc.
• Analyzes structure of sentences– analysis of words - prefixes, suffixes, roots– word-level analysis including parts of speech– analyzes structure & relationships between words in a sentence– possible meanings of a sentence; enhanced by statistical analysis
10
11
The Content Supply Chain
11
• We view the publishing process in terms of a supply chain• It begins with content acquisition through conversion and
enhancement, on to product assembly and, lastly, to product publishing and distribution
• Using semantic tools has an impact on roles and responsibilities, workflows and the way content is processed at each stage of the content supply chain
• Semantic tools and text mining are used at different stages of the editorial and production process
Supported by: DTDs; content, digital asset repositories; policies; workflow management; metadata; rights management; resources; product definitions; internal and external systems
Source /
Create
Convert / Structure
Normalize
Store / Manag
e
Edit /
Enhanc
e
Produc
t Assembly
Publish / Distribute
12
Controlled vocabulary and authority list management; taxonomy managers; knowledge management
Linking; entity extraction; citations; classification , machine aided indexing; contextual meaning
Extract content for tagging; identify not only document structure but document meaning; structure unstructured content
Intelligent agents for targeted retrieval (content federation); “acquire what is new or changed from sites I am interested in”
Abstracting, auto-summarization (e.g., synopses, headnotes)
Custom publishing; ‘Synthetic documents’
Content delivery for multiple output channels and product formats
Semantic Tools in the Content Supply Chain
12
Convert / Structure
Normalize
Store /Manage
Edit /Enhance
Product Assembly
Publish / Distribute
Source / Create
13
Case Studies
13
14
Preview of Case Studies
• Rules-based auto-classification• Document analysis and entity linking• Auto-summarization• Product assembly• Custom information feeds
14
15
Case StudyRules-based Auto-classification
15
16
Rules-based Auto-classification
RULES BASE
Apply rules to classify content against taxonomy
AUTO-CLASSIFICATION
System tracks rules usage (which ones used; frequency)
Set-up
INDEXER REVIEWINDEXER
Accepts, rejects, adds, classification termsReviews rules system applied that yielded wrong classificationFlag problems to rules builder; suggest new terms
SYSTEMTracks rules that generated
incorrect classifications
INDEXER
Add/remove terms; Create groupings; Map terms
Automatic update of rules to reflect changes in taxonomy
TAXONOMY MANAGER
SYSTEM
Baseline Test Set
DEFINE CLASSIFICATION RULES RULES MANAGEMENT
Review usage statistics Rules used, not used; add, modify, delete rules
Indexer defines classification rules
Test & adjust rules
16
17
Case StudyDocument Analysis and Entity Linking
17
18
Document Analysis and Entity Linking
• Focus is on document analysis and entity linking in editorial workflow
• Subsidiary of a global legal publishing house– content base of 3.5 million cases, related documents– manages over 17 million citations– updates of case law processed daily– cases growing at 20% per annum
• Challenges– avoid processes performed manually by individuals– allow the user to select and filter the information needed for their job– take into account an increasing number of legal information sources
• Describes target configuration but not yet fully realized
18
19
Goals for the New Process• Aid the process of knowledge extraction and storage
– identify legal sources (e.g., official publication, case law decision)
– extract legal citations (which source is cited and why?)– populate a knowledge base and cyclically enrich the content
• Process each piece of information one time– normalize, tag, enrich, link, form concepts, etc.
• Build standardized common knowledge base for use throughout the editorial and production process and by downstream by end-users
• Maintain consistent thesauri, ontologies, taxonomies and provide a mechanism for their management and updating
19
20
Document Analysis and Linking Process
20
Lega
l edi
tors
SEARCH AND NAVIGATION SERVICES
Entity error
REVIEW AND QC ENRICHED CONTENT
Link error
Concept error
Use search, navigation tools to review, identify, and correct
Libr
aria
n
LIST AND RULES MAINTENANCE
Weekly review of exception reports
KNOWLEDGE MANAGEMENT
Aut
omat
ed S
eman
tic
Ana
lysi
s
Domain-specific lists
for entity recognition
Text mining rules
Baseline Test Set
Test text analysis tool
DEFINITION PHASE
Entity extraction
Iterative application of rules
Tag content
Linking
AUTOMATED TEXT ANALYSIS
20
21
Benefits of the New Process• Workflow
– a semi-automated process– editors review QC output from text mining tool to enhance and correct as
necessary– analysis and linking by automated text analysis tool– parallel processing in text analysis tool– analysis, referencing and linking became part of the same workflow
• Roles and responsibilities– editors no longer need to be experts in mark-up languages; content is tagged
automatically– low value editorial tasks handled by text analysis tool– existing staff can focus on high value tasks– new role to maintain and enhance semantic lists and text mining tool rules
• Content– quality document analysis improves through enhancements to the lists and rules
used by the text mining tool– able to federate metadata across multiple content management systems– same knowledge base and text mining tool integrated into online products
21
22
Case StudyAuto-summarization
22
23
Auto-summarization – Major Newspaper
Manual summarizationOutsource or in-house experts
Content in
Rules BaseExtent of automation depends on article importance
OR
OR
OR
Document AnalysisSource; type; format;
content
Auto-summarization
Auto-summarization (draft version)
Expert review and edit (final version)
• Document zones• Rules: semantics; dictionary;
complex grammar rules• Section weightings• Sentence position• Relative importance of sentences• Markers for start of sections,
paragraphs, sentences• Sentence length of summary
Administrator monitors, improves rules set based on usage
23
24
Case StudyProduct Assembly
24
25
Product Assembly
Content Repository
XML Content Store Rich Media
Analyze / Classify / Enhance - Editorial
Source / Capture
Convert / Normalize
NewContentProcess
Select content(XQuery)
Select content(XQuery)
Select content(XQuery)
Select content(XQuery)
Extract Product Content From Repository
RenderFOSI; XSLFO;
ProprietaryRender
XSLT; CSS; RSSRender Render
WCSS; ProprietaryFormatProduct
25
26
Case StudyCustom Information Feeds
26
27
Custom Information Feeds
27
End Users Content
BASEBALLFO
OTBALL
HOCKEY
PROREC
COLLEGE
HIGH SCHO
OL
SOCCER
SCORES
PLAYERS
RULES
STATS
SCHEDS
Repository
NEWS
STANDINGS
RICH MEDIA
PEOPLE
XML
Delivery
REAL-TIME UPDATES
TARGETED INFO
REAL-TIME FEEDS
ENRICHED EMAIL
REAL-TIME FEEDS
ENRICHED EMAIL
28
Benefits of Using Semantic Technologies • People
– minimize high-value resources performing commodity tasks– editors focus on real editorial added value; no need to be concerned about markup– increased capacity without increasing headcount– novice indexers come up to speed quicker
• Process– reduced processing time due to automation– sequential tasks can be performed in one step– products can be more targeted to specific customer needs– parts can be outsourced
• Content– richer more consistent classification, linking, summarization, semantic tagging– common controlled vocabularies maintained and applied across entire content base– same content can be classified and summarized along more dimensions to serve
different customer groups– greater value can be extracted from unstructured content with text mining and
semantic analysis– taxonomy managers support a rigorous approach to maintenance and updating
28
29
Challenges Using Semantic Technologies • People
– retrain resources for new roles (rules builder, taxonomy manager, etc.) is time consuming
– level of accuracy depends on ability of editors to write logical rules• Process
– time required to refine rules and train analysis engine can be extensive (some report 12-18 months)
– productivity improvements are a function of thesaurus structure, rule-builder’s skill level, document type; the more complex any of these are the longer it takes to achieve return on investment
• Content– automated content analysis doesn’t match up to the analytical skills of
trained subject area experts (at least in some highly technical disciplines)– some find it difficult to measure the impact of indexing consistency– lower quality when there is fully automated machine aided indexing with
no follow-on QC by subject area experts
29
30
Questions
30
31
www.innodata-isogen.com
Stephen CohenPrincipal [email protected]+1 (201) 371-8044
WWW.INNODATA-ISOGEN.COM
Proprietary and Confidential
THANK YOU
Innodata Isogen, Inc.Three University PlazaHackensack, NJ 07601+1 (201) 371-2828www.innodata-isogen.com