t h o m s o n f i n a n c i a l ian koenig chief architect – thomson financial c4. case study:...
TRANSCRIPT
T H O M S O N F I N A N C I A L
Ian Koenig
Chief Architect – Thomson Financial
C4. Case Study: Event Processing as a Core Capability of Your Content Distribution Fabric
20 September 2007
Copyright © Thomson Financial
Agenda
1. A framework for discussing Complex Event Processing
2. “Elementized” News as a data source
3. The “fabric” for distributing content and its emerging capabilities for stream processing
4. Enabling new types of data sources creating new ‘opportunities’
5. Hinting at a larger pattern for distributing content and the role that complex event processing will play.
2
Copyright © Thomson Financial
The CEP Framework model
CEP Engine Ap
plicatio
n
Lo
gic
News
Level 2
Level1
Stream
Ad
apters
ContentSources
ContentStreams
Other
User Interface
New ContentStreams
Event Proc
Event Proc
Event Proc
Stream Agents
Copyright © Thomson Financial
Thinking Outside the CEP Box
News
Level 2
Level1
ContentSources
ContentStreams
Other
CEP Engine Ap
plicatio
n
Lo
gic
Stream
Ad
apters
User Interface
New ContentStreams
Event
Proc Event
Proc
Event
Proc
Stream Agents
Outside The CEP Box
Copyright © Thomson Financial
Thinking Outside the Box
5
The Classic 9 Dots Puzzle: Connect all 9 dots with 4 straight lines without ever taking the pencil off of the paper
To solve the problem, you have to “think outside the box”
Copyright © Thomson Financial
CEP Engine Ap
plicatio
n
Lo
gic
Stream
Ad
apters
Event
EventEvent
Stream Agents
Other
Level 2
Level1
Content Sources and Content Distribution
6
News
ContentSources
ContentStreams
ContentDistribution
Fabric Application Logic
Copyright © Thomson Financial
Why News?
7
SEC will Allow Companies to use the Internet to Improve Investor-Management Communications NAFrom CFO.com - August 16, 2007According to SEC chairman Christopher Cox, the commission will allow companies to use the Internet to improve investor-management communications. As currently proposed by the commission, a company interested in offering this venue to shareholders would alert them via
….News Moves Markets ….Elementized News Moves Markets …
Copyright © Thomson Financial
The Metaverse
9
The Metadata Universe (or Metaverse) is the set of Categories (Entities and Subjects) that provide semantic understanding for text and data.
GeographyRegions, Countries, Physical Features
IndustrySector Hierarchy(Multiple Schemes)
EventCorp. ActionMeeting, et al
Is g
roup
ed b
y
Officer of
OrganizationGov’t, Agency, Company , NGOMarket Participant
Subsidiary ofAnalyst For
Analyst For
Person(Multiple Roles)
Quote, Trade, IOI, Advertisement,Order
Market(Equity, CommodFI, et al)
Index Financial Indexes
InstrumentSecurity, Future,Derivative, et al
IndicatorEconomics, Market Stats
Issues
List
ed (
Mar
ket
Par
ticip
ant)
Has Quotes
Operates within
Mkt. Part. – Provides Quotes
Indi
cato
r F
or
Index ForClassification Standard
GEOGRAPHY ISO 3166
INDUSTRY SIC + NAICS + TSE + GICS
MARKETS ISO 10962
CURRENCY ISO 4217
CORPORATE ACTIONS
ISO 15022
RESEARCH RIXML
Africa
Americas
Asia
Europe
Oceana
Ge
og
rap
hy
Equities
Central America
United States of America
Alabama
Arkansas
North America
Ind
ust
ryM
ark
ets
Debt
Package Units
Futures
Currency
other
Copyright © Thomson Financial
Entity: Schering-Plough (SGP-US) – An Organization of type: CompanyEntity: Merck KGAA (MRK-US) - An Organization Entity of type: Company Entity: Pharmaceuticals - An Industry Entity
Categorization Mark-up Example
10
Copyright © Thomson Financial
NewsML Mark-up example
11
<?xml version="1.0" encoding="UTF-8"?> <newsItem guid="urn:newsml:CBS MarketWatch:20030620:20040903-000693:2" schema="0.0" dir="ltr" version="1" xmlns="http://iptc.org/std/nar/2006-10-01/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:toc="http://data.schemas.tfn.thomson.com/Common/2007-08-01/"> <catalogRef href="http://iptc.org/std-dev/NAR/1.0/specification/IPTC-TempCatalog-inc_4.xml"/> <catalogRef href="http://news.schemas.tfn.thomson.com/schemes/TF_NewsML-G2-catalog.xml"/> <rightsInfo> <copyrightNotice>(C) 1997-2004 MarketWatch.com, Inc. All rights reserved.</copyrightNotice> </rightsInfo> <itemMeta> <itemClass qcode="ccls:text"/> <provider qcode="org:TFN"/> <versionCreated>2001-12-17T09:30:47.0Z</versionCreated> <firstCreated>2001-12-17T09:30:47.0Z</firstCreated> <pubStatus qcode="stat:usable"/> <role qcode="rol:urgent"/> <service qcode="NewsServiceId:NSID1"> <name>News Service 1</name> </service> </itemMeta> <contentMeta toc:careVersion="1" toc:careTrainingSet="2007-07-01" toc:dexterVersion="1" toc:dexterTrainingSet="2007-07-01" toc:stratifyVersion="1" toc:stratifyTrainingSet="2007-07-01"> <urgency>3</urgency> <contentCreated>1967-08-13</contentCreated> <contentModified>1967-08-13</contentModified> <infoSource qcode="org:TFN"/> <headline>Staffing company shares mixed after jobs report</headline> <by>Ciara Linnane</by> <dateline>12:21 PM ET Sep 3, 2004</dateline> <language tag="en-us"/> <subject type="type:subject" qcode="CategoryId:1234567" creator="org:thomson"/> <subject type="type:subject" qcode="CategoryId:1234568" creator="sys:care" why="why:machine-generated" confidence="70" relevance="65"/> <subject type="type:organization" qcode="OrganizationId:0123456789"/> </contentMeta> <contentSet xmlns:tfc="http://news.schemas.tfn.thomson.com/Common/2007-07-06/" xsi:schemaLocation=" http://news.schemas.tfn.thomson.com/Common/2007-07-06/ NewsCommonTypes.xsd"> <inlineXML xml:lang="en-us" contenttype="application/xhtml+xml" xsi:schemaLocation="http://www.w3.org/1999/xhtml xhtml11-tfnews.xsd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Staffing company shares mixed after jobs report</title> </head> <body> <p>NEW YORK (CBS.MW) -- After rallying <span guid="xxxx">for the past few sessions, </span> shares of staffing firms and payroll processors were mixed Friday as investors digested the August jobs report, showing <toc:Category xsi:type="toc:Indicator" IndicatorId="0234556">U.S. payrolls</toc:Category> rebounding after two sluggish months.</p> <p>The <toc:Category xsi:type="toc:Organization" OrganizationId="9000000056"> Labor Department</toc:Category> said the economy added 144,000 jobs, well above the 32,000 reading in July.</p> <p>The <toc:Category xsi:type="toc:Indicator" IndicatorId="0234551">unemployment rate</toc:Category> fell one-tenth of a percentage point to 5.4 percent, the lowest rate since October 2001, primarily because 152,000 adults dropped out of the labor force.</p> <p>Economists surveyed by CBS MarketWatch were expecting job growth of about 158,000, close to the 177,000 average for the first seven months of the year, and a jobless rate of 5.5 percent. <a href="http://cbs.marketwatch.com/news/economy/economic_calendar.asp?siteid=mktw">See Economic Calendar. </a> </p>
Document Level Mark-up (Categories only)...
<subject type="type:subject" qcode="CategoryId:1234567" creator="org:thomson"/><subject type="type:subject" qcode="CategoryId:1234568" creator="sys:care" why="why:machine-generated" confidence="70" relevance="65"/>...
In-line Markup (Categories + Facts)<body>...
<p>The <toc:Category xsi:type="toc:Indicator" IndicatorId="0234551">unemployment rate</toc:Category> fell one-tenth of a percentage point to 5.4 percent, the lowest rate since October 2001, primarily because 152,000 adults dropped out of the labor force.</p>...<p>"We were encouraged to see the headline payroll number meet expectations after two months of
disappointments," said <toc:Category xsi:type="toc:Organization" OrganizationId="0234556">SunTrust Robinson Humphrey</toc:Category> analyst <toc:Category xsi:type="toc:Person" PersonId="122456">Tobey Sommer</toc:Category>. The report, he said, "is likely to improve investor sentiment on employment-related stocks."</p>
<p> <toc:Category xsi:type="toc:Quote" QuoteId="123456781">Manpower (MAN-US)</toc:Category> shares led the gainers, rising 2.5 percent to $44.52. <...</body>
Copyright © Thomson Financial
Auto-categorization Technology
12
• Much Financial, Legal and Medical information exists in the form of textual documents
• Traditional “Editorial” processes to tag/index documents can no be augmented by algorithms that can achieve very high precision (~95%) against very large ontologies (10,000’s of terms)
• Thomson employs a technology called CaRE (Categorization and Recommendations Engine) to do this, which originated in the Thomson Legal and Regulatory division.
• CaRE uses a set of statistics-based algorithms that are trained to understand a specific ontology as a concept scheme.
Copyright © Thomson Financial
Elementized News – Summary
13
• Each News story is tagged at three levels. • Document Level: The overall story lists all the category metadata
(Entities + Subjects + Genre + Sentiment) for the story.• In-line Entities: Each initial reference to an Entity is marked up “in-
line” in the document for additional context.• In-line Facts: Specific Numeric Elements (e.g. US GDP or Thomson
Q3 Revenue) are tagged using XML elements
In-line News vs. Document level Mark-up
• Sentiment tags (e.g. positive earnings or negative rating) and Subject tags provide semantic understanding of the news story
• Numeric Facts (when Elementized) are directly process-able by algorithms.
• Entity tags (e.g. Company references) allow news to be linked and correlated to Market data streams by CEP engines, for example, to make trading decisions
The Value of News Elements
Copyright © Thomson Financial
CEP Engine Ap
plicatio
n
Lo
gic
Stream
Ad
apters
Event
EventEvent
Stream Agents
Other
Level 2
Level1
Content Sources and Content Distribution
14
News
ContentSources
ContentStreams
ContentDistribution
Fabric Application Logic
Copyright © Thomson Financial
The Content Distribution Fabric
15
Initialization
Service Provider
Service Consumer
Synchronization
Service Provider
Service Consumer
Cont
ent A
war
e Ne
twor
k
Intermediation
Service Provider
Service Consumer
Service Contract
Service Broker
Fin
d
Bin
d
Reg
ister
Copyright © Thomson Financial
“X” Marks the spot
16
Copyright © Thomson Financial CONFIDENTIAL17
Content-Aware Hardware Infrastructure
MobileDevices
Applications
Databases
Applications
Applications
Content-AwareNetwork
IP/MPLSNetwork
Routing Module
Transformation
Module
AdvancedInterface Module
AssuredDelivery Module
500, 000 routes 1000’s xforms / sec
>1MM msgs / sec
Active/active fail-over
0.7ms transit for a 4K XML document
Copyright © Thomson Financial
New Streaming Content Sources
18
ContentStreams
Financials
Deals (M&A)
Level 2
Level1
News
ContentSources
Research
Briefings
Filings
Estimates
…
CEP Engine Ap
plicatio
n
Lo
gic
Stream
Ad
apters
Event
EventEvent
Stream Agents
Complex Event
Processing Applications
Content Distribution Fabric
(Intermediation, Initialization,
Synchronization)
Copyright © Thomson Financial
Logical
Entity
Entity
Entity
Business EntitiesBusiness
EntitiesCanonical Business
Entities
The Entity Model vs the Relational Model
19
Relational Data
Physical
Table
Table
Table
Table
Transform
Copyright © Thomson Financial
Changed Data Capture as an Enabling technology
20
• .
Content Distribution
Fabric
TableTable
Transaction
Log
TableTriggers
or Log Mining
Content Source Changed Data
Capture (publish)
1: Publishing pipeline – For Databases built using a “publishing pipeline pattern”, events can be generated directly2: Database Triggers – Database triggers can be used to generate events, but this is not recommended 3: Log Mining – is a technique that watches the transaction log that modern databases use to capture all changes as they are made. 4: Transformation– The final step is transforming the transactional changes made to the databases to XML messages that capture the “business event” process-able downstream.
3
Transform4
Copyright © Thomson Financial
Ingest Interface (Feeds + Authoring)
Content Source(s)
MetadataThe Application Database
Data Interface (Content Distribution)
Service Interface
Human Interface
21
Content Distribution Pattern
The Content Master
The Enterprise Database
Content Distribution Fabric
(Intermediation, Initialization,
Synchronization)
Canonical Data Model(in XML)
Copyright © Thomson Financial22TF Information Architecture
v0.7
Content Master Database
Content Source(s)
Ingest Interface (Ripping)
Data Interface (Content Distribution)
Application Database
Service Interface
Human Interface
Metadata
And if you Squint just a little tiny bit …
Copyright © Thomson Financial
The World of Event-Oriented Content
23
ContentStreams
Financials
Deals (M&A)
Level 2
Level1
News
ContentSources
Orders
IOIs
Research
Briefings
Filings
Estimates
And More
CEP Engine Ap
plicatio
n
Lo
gic
Stream
Ad
apters
Event
EventEvent
Stream Agents
Complex Event
Processing Applications
In in this new world, all content has the potential to change “transactionally”. We have lots of interesting new content streams for CEP aware applications and a Content distribution fabric that itself has event stream processing capabilities.
Content Distribution Fabric
(Intermediation, Initialization,
Synchronization)
Copyright © Thomson Financial
The End
24
Copyright © Thomson Financial
Appendix
25
Copyright © Thomson Financial 26
Thomson Master Categories: Sample Structure
Canonical Presentation
Economics & Trade
Surveys & Cyclical Indexes
National Accounts
Money & Finance
Consumer Surveys
Business Surveys
Cyclical & Activity Indexes
GDP by Expenditure
GDP by Industry
Incomes
Investment Capital
Exports
Imports
Money Supply
Activity Index
Leading Indicator
Geography3353 Categories
Industry2482 Categories
Market354 Categories
InstrumentSecurity, Future,Derivative, et al
IndicatorEconomics, Market Stats
EventCorp Action, et al
Canonical Terms are mapped to presentation terms at the most precise Level of the hierarchy. The presentation hierarchy contributes to search relevance.
Copyright © Thomson Financial27
Intelligent Network Hardware Performance
Messaging Throughput(msgs/sec)
Tens ofthousands
>Million
Messaging Latency(at 50% of peak load)
Milliseconds
Microseconds
Content Routing(number of rules)
Small numberof thousands
Hundreds ofThousands
Content Routing Latency(with 1000+ content rules)
Seconds
Microseconds
Persistent Messaging(msgs/second)
A FewThousand
Many Tens ofThousands
Software Infrastructure Hardware Infrastructure
Transformations(sustained throughput)
MB/sec
GB/sec