automatic financial data integration with fibo
DESCRIPTION
Data integration with FIBOand Legal Entity Identifiers Problem:businesses produce and exchange documents and messages in a wide variety of formats with different descriptors for entities and their properties Text, PDF, DOC, XML, JSON, CSV, XLS, FIBO, SWIFT, ISO 20022, FIX, etc. Archival documents stored in “data lakes” But 80% of the effort in any “data mining” program is in data cleaning Swap data is published in similar but not identical formats… Cash-flow modeling and prediction. Name and concept resolution as a serviceTRANSCRIPT
Data integration with FIBO
and Legal Entity Identifiers
A whole product proposal
Ontology2
Paul Houle+1 (607) 539 6254
KM SolutionsWilliam Freeman
+1 (774) 301 1301
Problem:businesses produce and exchange documents and messages in a wide variety of formats with different descriptors for entities and their properties
Text, PDF, DOC, XML, JSON, CSV, XLS, FIBO, SWIFT, ISO
20022,FIX, etc.
Keeping track of this in interactive systems is hard
Archival documents stored in “data lakes”
But 80% of the effort in any “data mining”
program is in data cleaning
Clean knowledge and data packaged for low-latency queries and quality-
controlled decisions
Solution:toolsets and methods now exist to drastically accelerate this process
Scalable-first Architecture
Business-friendly inference based on First-Order Logic
:BaseKB
Numerous sources of lexical, taxonomical, ontological and
operational information
Natural languagedocumentation
Statistical quality control of decision making systems; tests of
specific requirements
Reproducable packaging of software and data for almost any
cloud or virtual environment
SUGGESTEDUPPERMERGEDONTOLOGY
IEEE
RDF SQL CSV
PRODUCTIONRULES
FIRST-ORDERLOGIC
ISOCOMMONLOGIC
OWL
MODAL AND HIGHER-ORDER LOGIC
TAXONOMIES, DICTIONARIESdefine and relate terms and
entities
CONCEPTUAL ONTOLOGIES, AND THESAURI
Properties of and specialized relationships between entities
OPERATIONAL ONTOLOGIESAND THEORIES
Targeted quality-controlled inference for specific domains
AUTOMATIC GENERATIONOF USER INTERFACES
Browsing interfaces, Mixed Initiative and Asynchronous
interaction
Canonical data model
Cash-flow modeling and prediction
Swap Data Repositories
Swap data is published in similar but not identical formats…
Different field names“DISSEMINATION ID” vs “Dissemination Id”“EXECUTION TIMESTAMP” vs “EXEC TIMESTAMP”
Vocabularies used can be different: i.e. “TRUE” or “FALSE” vs “Y” and “N” or “0” and “1”
Columns are grouped: “PRICE NOTATION TYPE” and “PRICE NOTATION” or “{FIELD}_{NUMBER}_{SUBFIELD}”
Column Profiling And AnalysisEmpirical Rules
“Squash case in identifiers”“Underscores can be equivalent to spaces”“EXECUTION” can be shortened to “EXEC”
Resolution against dictionaries
FIBO / ISO 20022 Controlled TermsWikipedia / Freebase / :BaseKBWordNet
Grammatical Patterns:
PRICE_NOTATION_1PRICE_NOTATION_2PRICE_NOTATION_CURRENCY_1PRICE_NOTATION_CURRENCY_2
Defined Data Formats
Resolution against XSD, OWL, UML or printdocumentation
Column Statistics, Hypothesis Generationand testing
Identify plausible interpretationsof fields as numeric, date, time,name, address, Boolean,Keys and controlled vocabulary terms
Type Constraints
Field has “end date” in name+
Field contains formatted datesProbable “end date” field
Ontology Fragment Matching
Inferred relationships tested forStructural match against knownontologies
Test data
HypothesisGeneration
ConstraintSolving
Production Rules ++ Engine
Reference Ontologies
DataConversion
Rulebox
Production Rules Engine
Canonical Data Model
Clean Data
TheoryLibrary
DataConversion
Rulebox
Machine Readable
Description of API
ACTUSModel
Selectionand
ParameterDetermination
Possible Histories
Predictions aboutCash flow over time
Deployed toLibrary
Name and concept resolution as a service
“The Walt Disney Company”“ ウォルト・ディズニー・カンパニー
دیزنی والت شرکت华特迪士尼公司
“Los Angeles, CA”“Los Ángeles”
“Los-Anĝeleso”“Лос Анжелес”
“The Parent Trap”“À nous quatre”
“Nie wierzcie bliźniaczkom”“Лос Анжелес”天生一对
Core database is independent of language
Context-sensitive name resolver drives user interface
and text analysis
m.09c7w0
english
spanish
short
long
“USA”
“United States of America”
“Estados Unidos”
JFK
“We landed at _”
Name
Context
tabular & xml data Names
Spatial RelationshipsCompany-place associations
Company-company associations
Post Code City Region Country
94116 SAN FRANCISCO CALAFORNIA US19866 WILMINGTON DE US
1044 BRUSSELS BRUSSELS-CAPITOL BE6027 INNSBRUCK AT-07 AT
Resolution…
• Of registration authorities that issue numbers (i.e. “Company House”)
• Of companies to EDGAR CIK identifiers• Of companies against your customer list• Of companies against :BaseKB
Decomposition into independently scalable/tunable microservices
Test data
Judgments Invariants
Requirements
System
Output data
Comparison
Humans check output for quality
Humans change fact base and rule base
situations
decisions
Backgroundknowledge
Machinelearning
Team can fix problems at the root of errors in a decision-processing pipeline with version control and paper trail or choose to bypass
98.5% of addresses matched to region in global business database
resolvednot resolved
Systematic improvement by solving increasingly less common problems
Statistical Quality Control forSubjective Decisions
Proven Methodology of Automated Unit and Integration Testing
Test for wanted and unwanted
behaviors
“Good Enough”or improved
metrics
SystemCustomer
Testing accelerated by parallel and reproducible methods in Hybrid Cloud
High Throughput / Scalable / Parallel
Efficient Data Structuresand serialization
driven by meta-model
Simple Binary Encoding
Specialized indexes for low-latency and expressive queries to support
user interfaces and decision-making
C L E O
The Need For Speed