tech view on regulatory compliance
TRANSCRIPT
Tech view on Regulatory ComplianceMarkLogic User Group Benelux Meetup December 2016
Speaker: Alexander L. de Goeij
About me
• Architect / Consultant
• Financial Services: Core Trading
• Regulations: EMIR, MiFID II
• Architecture: Enterprise / Solution / Project Architect
• Consulting: IT Strategy, implementations, vendor selection, etc.
• Business degree, Tech addiction.
“Regulations really make my life more fun! ”As said by no-one, ever.
“Regulations really make my life more fun! ”As said by no-one, ever.
everyone who gets to use cool databases!
exciting
The challenge we think we are facing:
TransformExtract
Source DataHappy
Regulator
Load Send
extractload
Some Application
The actual challenge we are facing:
HappyRegulators
DB 1Load
Source Data
ExtractEmail
FTP
REST
SOAP
Tool 2Load Extract
Thing NLoad Extract
Database you didn’t know still existed
Current solution:
Doesn’t work anymore:
• Auditability / Process checks included in Regulations.
• Obligation to re-report.
• More complex Ad-Hoc requests from the Regulator.
• Not suited for Real-Time reporting.
• Waste of money…
What do we need?
• Auditability: keep original data in original format to prove results, keep track of ‘who-did-what’ with the data.
• Consistency: real-time requirement from regulator demands more than eventual consistency.
• Forward Flexibility: we know we don’t know what we will have to report tomorrow.
Looking to technology for a better answer!
Your favorite RDBMS
• ACID, consistent, and blazing fast if you buy Exadata
• Normalize your way out, and fail.
• Not fit for processing/reporting across different data objects: e.g. Trades and Mortgages
• Try to do NoSQL with SQL (innovative, but terribly slow and impossible to maintain)
Example of what not to do:
SQL
SQL
MongoDB
• Free! Open Source! GridFS!
• Have to transform data on ingest (to JSON) as most data is XML
• Eventual consistency (AKA data loss) means not real-time.
• Good at homogeneous data.
• Still master-slave, and scaling issues
• Brilliant for RAD / prototyping!
Where things go wrong:
Source: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
Cassandra (DataStax)
• Favors data duplication over normalization
• Very fast (if you duplicate well) but does not do JOINs
• Used by ING as main component of their Risk grid (YouTube)
• Excellent for time series data
Source: https://academy.datastax.com/resources/getting-started-time-series-data-modeling
Hadoop
Source: http://hortonworks.com/products/data-center/hdp/
MarkLogic
• Focused on heterogeneously structured data
• Bitemporal, if you dare
• Semantics / RDF Triples
• ACID, Consistent, stores original file
• ABAC & redaction in enterprise version
• Rules, Workflows, Alerts, Triggers
• Not a COTS!
Ok, so now what?
Two approaches to a solution
Infra approach:
• Build everything yourself, use open source components
E.g.:
• Hadoop
• Cassandra + Kafka
Platform approach:
• Focus on application and business logic, not on infra
E.g.:
• MarkLogic
• Spark (without Hadoop)
Akka ActorsAkka Actors
SparkSparkKafkaKafka
Infra approach (SMACK example)
• Used (and designed) by Netflix, LinkedIn, Uber, Twitter
• Massive amounts of event processing (IoT)
• HA and Geo distributed
• Scala, Python, R, Java(Script)
• Asynchronous everywhere
• Near impossible to destroy: reactive, self-healing, back-pressure.
Kafka
Akka Actors
Play REST APIs
Cassandra
Spark
Mesos OS
Bare Metal
Bare Metal
Bare Metal
Bare Metal
Cassandra
Cassandra
Zookeeper
Marathon
Play REST APIsPlay REST APIs
Platform approach
MarkLogic
Insert Time Series
Database here
Spark
Source Data
Qualitative
Quantitative
Data Flows Data Stores Analytics Feedback Loop
HappyRegulator
• Schema transformations• Business Rules• Workflow• Rights management
Main take-aways
• There are no one-stop solutions
• Don’t pick bleeding edge stuff if you need it to work
• Focus on Business benefit of investment in Regulatory Compliance
• Separate the platform from the project!
• Start small, think big
Thank you for listening !
Alexander L. de Goeij
References
• https://academy.datastax.com/resources/getting-started-time-series-data-modeling
• http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
• http://hortonworks.com/products/data-center/hdp/
• https://www.linkedin.com/pulse/data-hubs-marklogic-vs-hadoop-kurt-cagle
• https://engineering.linkedin.com/blog/2016/04/kafka-ecosystem-at-linkedin
• http://www.datanami.com/2015/10/05/how-uber-uses-spark-and-hadoop
• https://blog.twitter.com/2015/handling-five-billion-sessions-a-day-in-real-time
• http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflix.html