tech view on regulatory compliance

Tech view on Regulatory ComplianceMarkLogic User Group Benelux Meetup December 2016

Speaker: Alexander L. de Goeij

About me

• Architect / Consultant

• Financial Services: Core Trading

• Regulations: EMIR, MiFID II

• Architecture: Enterprise / Solution / Project Architect

• Consulting: IT Strategy, implementations, vendor selection, etc.

• Business degree, Tech addiction.

“Regulations really make my life more fun! ”As said by no-one, ever.

“Regulations really make my life more fun! ”As said by no-one, ever.

everyone who gets to use cool databases!

exciting

The challenge we think we are facing:

TransformExtract

Source DataHappy

Regulator

Load Send

extractload

Some Application

The actual challenge we are facing:

HappyRegulators

DB 1Load

Source Data

ExtractEmail

FTP

REST

SOAP

Tool 2Load Extract

Thing NLoad Extract

Database you didn’t know still existed

Current solution:

Doesn’t work anymore:

• Auditability / Process checks included in Regulations.

• Obligation to re-report.

• More complex Ad-Hoc requests from the Regulator.

• Not suited for Real-Time reporting.

• Waste of money…

What do we need?

• Auditability: keep original data in original format to prove results, keep track of ‘who-did-what’ with the data.

• Consistency: real-time requirement from regulator demands more than eventual consistency.

• Forward Flexibility: we know we don’t know what we will have to report tomorrow.

Looking to technology for a better answer!

Your favorite RDBMS

• ACID, consistent, and blazing fast if you buy Exadata

• Normalize your way out, and fail.

• Not fit for processing/reporting across different data objects: e.g. Trades and Mortgages

• Try to do NoSQL with SQL (innovative, but terribly slow and impossible to maintain)

Example of what not to do:

SQL

SQL

MongoDB

• Free! Open Source! GridFS!

• Have to transform data on ingest (to JSON) as most data is XML

• Eventual consistency (AKA data loss) means not real-time.

• Good at homogeneous data.

• Still master-slave, and scaling issues

• Brilliant for RAD / prototyping!

Where things go wrong:

Source: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

Cassandra (DataStax)

• Favors data duplication over normalization

• Very fast (if you duplicate well) but does not do JOINs

• Used by ING as main component of their Risk grid (YouTube)

• Excellent for time series data

Source: https://academy.datastax.com/resources/getting-started-time-series-data-modeling

Hadoop

Source: http://hortonworks.com/products/data-center/hdp/

MarkLogic

• Focused on heterogeneously structured data

• Bitemporal, if you dare

• Semantics / RDF Triples

• ACID, Consistent, stores original file

• ABAC & redaction in enterprise version

• Rules, Workflows, Alerts, Triggers

• Not a COTS!

Ok, so now what?

Two approaches to a solution

Infra approach:

• Build everything yourself, use open source components

E.g.:

• Hadoop

• Cassandra + Kafka

Platform approach:

• Focus on application and business logic, not on infra

E.g.:

• MarkLogic

• Spark (without Hadoop)

Akka ActorsAkka Actors

SparkSparkKafkaKafka

Infra approach (SMACK example)

• Used (and designed) by Netflix, LinkedIn, Uber, Twitter

• Massive amounts of event processing (IoT)

• HA and Geo distributed

• Scala, Python, R, Java(Script)

• Asynchronous everywhere

• Near impossible to destroy: reactive, self-healing, back-pressure.

Kafka

Akka Actors

Play REST APIs

Cassandra

Spark

Mesos OS

Bare Metal

Bare Metal

Bare Metal

Bare Metal

Cassandra

Cassandra

Zookeeper

Marathon

Play REST APIsPlay REST APIs

Platform approach

MarkLogic

Insert Time Series

Database here

Spark

Source Data

Qualitative

Quantitative

Data Flows Data Stores Analytics Feedback Loop

HappyRegulator

• Schema transformations• Business Rules• Workflow• Rights management

Main take-aways

• There are no one-stop solutions

• Don’t pick bleeding edge stuff if you need it to work

• Focus on Business benefit of investment in Regulatory Compliance

• Separate the platform from the project!

• Start small, think big

Thank you for listening !

Alexander L. de Goeij

[email protected]

mailto:[email protected]

https://nl.linkedin.com/in/alexanderdegoeij

https://twitter.com/aldegoeij

https://angel.co/aldegoeij

References

• https://academy.datastax.com/resources/getting-started-time-series-data-modeling

• http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

• http://hortonworks.com/products/data-center/hdp/

• https://www.linkedin.com/pulse/data-hubs-marklogic-vs-hadoop-kurt-cagle

• https://engineering.linkedin.com/blog/2016/04/kafka-ecosystem-at-linkedin

• http://www.datanami.com/2015/10/05/how-uber-uses-spark-and-hadoop

• https://blog.twitter.com/2015/handling-five-billion-sessions-a-day-in-real-time

• http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflix.html

https://engineering.linkedin.com/blog/2016/04/kafka-ecosystem-at-linkedin



https://www.linkedin.com/pulse/data-hubs-marklogic-vs-hadoop-kurt-cagle


http://www.datanami.com/2015/10/05/how-uber-uses-spark-and-hadoop

https://blog.twitter.com/2015/handling-five-billion-sessions-a-day-in-real-time

http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflix.html

tech view on regulatory compliance

Technology