java tech & tools | beyond the data grid: coherence, normalisation, joins and linear scalability...
DESCRIPTION
2011-11-02 | 02:25 PM - 03:15 PMIn 2009 RBS set out to build a single store of trade and risk data that all applications in the bank could access simultaniously. This talk discusses a number of novel techniques that were developed as part of this work. Based on Oracle Coherence the ODC departs from the trend set by most caching solutions by holding its data in a normalised form making it both memory efficient and easy to change. However it does this in a novel way that supports most arbitrary queries without the usual problems associated with distributed joins. We'll be discussing these patterns as well as others that allow linear scalability, fault tolerance and millisecond latencies.TRANSCRIPT
![Page 1: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/1.jpg)
ODCBeyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Ben Stopford : RBS
![Page 2: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/2.jpg)
![Page 3: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/3.jpg)
How do you support low latency and high throughput?
![Page 4: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/4.jpg)
Database Architecture is Aging
Most modern databases still follow a 1970s architecture (for example IBM’s System R)
![Page 5: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/5.jpg)
“Because RDBMSs can be beaten by more than an order of magnitude on the standard OLTP benchmark, then there is no market where they are competitive. As such, they should be considered as legacy technology more than a quarter of a century in age, for which a complete redesign and re-architecting is the appropriate next step.”
Michael Stonebraker (Creator of Ingres and Postgres)
![Page 6: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/6.jpg)
The lay of the land: The main architectural
constructs in the database industry
![Page 7: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/7.jpg)
The Traditional Architecture
• Data is on disk• Users have an allocated user
space, where intermediary results are calculated.
• The database brings data, normally via indexes, into memory and performs filters, joins, reordering and aggregation operations.
• The result is sent to the user.
![Page 8: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/8.jpg)
Improving Database Performance
Shared Disk Architecture
SharedDisk
• More ‘grunt’• Suffers from issues
arising from disk and lock contention.
![Page 9: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/9.jpg)
Improving Database Performance
Shared Nothing Architecture
• Massive storage potential
• Massive scalability of processing
• Commodity hardware• Limited by cross
partition joins
![Page 10: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/10.jpg)
Improving Database Performance (3)
In Memory Databases
![Page 11: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/11.jpg)
Memory is at least 100x faster than disk
0.000,000,000,000
μs ns psms
L1 Cache Ref
L2 Cache Ref
Main MemoryRef
1MB Main Memory
Cross Network Round Trip
Cross Continental Round Trip
1MB Disk/Network
* L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm
![Page 12: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/12.jpg)
The architecture of an in memory database
• All data is at your fingertips.
• Query plans become less important as there is no IO
• Intermediary results are just pointers.
![Page 13: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/13.jpg)
This makes them very fast!!
![Page 14: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/14.jpg)
The proof is in the stats. TPC-H Benchmarks on a
1TB data set• Exasol: 4,253,937 QphH (In-Memory DB)• Oracle Database 11g (RAC): 1,166,976 QphH• SQL Server: 173,961QphH
![Page 15: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/15.jpg)
So why haven’t in memory databases
taken off?
![Page 16: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/16.jpg)
Address-Spaces are relatively small and of a
finite, fixed size
• Is there enough memory for your all your data?
• What happens when your data grows: The ‘One more bit problem’?
![Page 17: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/17.jpg)
Durability
What happens when you pull the plug?
![Page 18: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/18.jpg)
Improving Database Performance (4)
Distributed In Memory (Shared Nothing)
![Page 19: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/19.jpg)
Distribution Solves These Three Problems
• Solve the ‘one more bit’ problem by adding more hardware.
• Solve the durability problem with backups on another machine.
![Page 20: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/20.jpg)
Improving Database Performance (5) No SQL / Distributed
Caching
![Page 21: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/21.jpg)
Simplifying the Contract
Do you really need all of ACID?• Atomic• Consistent• Isolated• Durable
![Page 22: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/22.jpg)
Databases have huge operational overheads
Research with Shore DB indicates only 6.8% of
instructions contribute to ‘useful work’
Taken from “OLTP Through the Looking Glass, and What We Found There” Harizopoulos et al
![Page 23: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/23.jpg)
Avoid all that overhead
RAM means:• No IO• Single Threaded
No locking / latching
![Page 24: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/24.jpg)
Regular DatabaseOracle, Sybase,
MySql
Scale out
Distributed
In-Memory
Exasol, VoltDB, Hana
NoSQL Mongo, Cassandra
Shared Nothing
Vertica, Greenplum
b
Scale out & drop
ACID
Data Grid Coherence,
Teracotta
Scale out & drop
Disk & ACID
In-Memory Databas
e
TimesTen, HSQL,
KDBSingle
Address
Space
![Page 25: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/25.jpg)
So what are the themes that underpin this
progression?Distributed Architecture
Simplify the Contract
Stick to RAM
![Page 26: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/26.jpg)
ODC – Distributed, Shared Nothing, In Memory, Semi-
Normalised, Graph DB450 processes
Messaging (Topic Based) as a system of record
(persistence)
2TB of RAMOracle
Coherence
![Page 27: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/27.jpg)
The LayersD
ata
Layer Transactio
ns
Cashflows
Query
Layer
Mtms
Acc
ess
La
yer
Java client
API
Java client
API
Pers
iste
nce
Layer
![Page 28: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/28.jpg)
Three Tools of Distributed Data Architecture
Indexing
Replication
Partitioning
![Page 29: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/29.jpg)
How should we use these tools?
![Page 30: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/30.jpg)
Replication puts data everywhere
Wherever you go the data will be there
But your storage is limited by the memory on a node
![Page 31: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/31.jpg)
Partitioning scalesKeys Aa-Ap
Scalable storage, bandwidth and processing
Associating data in different partitions implies moving it.
![Page 32: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/32.jpg)
What about Denormalisation?
![Page 33: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/33.jpg)
Denormalisation implies the duplication of some
sub-entities
![Page 34: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/34.jpg)
…and that means managing consistency over
lots of copies
![Page 35: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/35.jpg)
…and all the duplication means you run out of space really quickly
![Page 36: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/36.jpg)
Spaces issues are exaggerated further when
data is versioned
Trade
Party
Trader Version 1
Trade
Party
Trader Version 2
Trade
Party
Trader Version 3
Trade
Party
Trader Version 4
…and you need versioning to do MVCC
![Page 37: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/37.jpg)
And reconstituting a previous time slice
becomes very diffi cult.Trad
ePart
yTrade
r
Trade
Trade
Party
Party
Party
Trader
Trader
![Page 38: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/38.jpg)
So we want to hold entities separately
(normalised) to alleviate concerns around consistency and space usage
![Page 39: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/39.jpg)
This means the object graph will be split across
multiple machines.
Trade
Party
Trader
Trade
Party
Trader
Independently Versioned
Data is Singleton
![Page 40: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/40.jpg)
Binding them back together involves a “distributed join” =>
Lots of network hops
Trade
Party
Trader
Trade
Party
Trader
![Page 41: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/41.jpg)
All these network hops make it slow
![Page 42: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/42.jpg)
Whereas the denormalised model the join is already
done
![Page 43: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/43.jpg)
Hence denormalisation is FAST!
(for reads)
![Page 44: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/44.jpg)
So what we want is the advantages of a normalised store at the speed of a denormalised one!
This is what using Snowflake Schemas and the Connected Replication pattern
is all about!
![Page 45: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/45.jpg)
Looking more closely: Why does normalisation mean we have to spread data around the cluster. Why
can’t we hold it all together?
![Page 46: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/46.jpg)
It’s all about the keys
![Page 47: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/47.jpg)
We can collocate data with common keys but if they crosscut the only way to
collocate is to replicate
Common Keys
Crosscutting
Keys
![Page 48: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/48.jpg)
We tackle this problem with a hybrid model:
Trade
PartyTrader
Partitioned
Replicated
![Page 49: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/49.jpg)
We adapt the concept of a Snowflake Schema.
![Page 50: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/50.jpg)
Taking the concept of Facts and Dimensions
![Page 51: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/51.jpg)
Everything starts from a Core Fact (Trades for us)
![Page 52: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/52.jpg)
Facts are Big, dimensions are small
![Page 53: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/53.jpg)
Facts have one key that relates them all (used to
partition)
![Page 54: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/54.jpg)
Dimensions have many keys
(which crosscut the partitioning key)
![Page 55: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/55.jpg)
Looking at the data:
Facts:=>Big, common keys
Dimensions=>Small,crosscutting Keys
![Page 56: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/56.jpg)
We remember we are a grid. We should avoid the
distributed join.
![Page 57: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/57.jpg)
… so we only want to ‘join’ data that is in the same
process
Trades
MTMs
Common Key
Use a Key Assignment
Policy (e.g.
KeyAssociation in Coherence)
![Page 58: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/58.jpg)
So we prescribe different physical storage for Facts
and Dimensions
Trade
PartyTrader
Partitioned
Replicated
![Page 59: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/59.jpg)
Facts are partitioned, dimensions are replicated
Data
La
yer
Transactions
Cashflows
Query
Layer
Mtms
Fact Storage(Partitioned)
Trade
PartyTrader
![Page 60: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/60.jpg)
Facts are partitioned, dimensions are replicated
Transactions
Cashflows
Dimensions(repliacte)
Mtms
Fact Storage(Partitioned)
Facts(distribute/ partition)
![Page 61: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/61.jpg)
The data volumes back this up as a sensible hypothesis
Facts:=>Big=>Distrib
ute
Dimensions=>Small => Replicate
![Page 62: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/62.jpg)
Key Point
We use a variant on a Snowflake Schema to
partition big entities that can be related via a partitioning key and
replicate small stuff who’s keys can’t map to our
partitioning key.
![Page 63: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/63.jpg)
Replicate
Distribute
![Page 64: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/64.jpg)
So how does they help us to run queries without
distributed joins?
This query involves:• Joins between Dimensions• Joins between Facts
Select Transaction, MTM, RefrenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
![Page 65: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/65.jpg)
What would this look like without this pattern?
Get Cost
Centers
Get LedgerBooks
Get SourceBooks
Get Transac-tions
Get MTMs
Get Legs
Get Cost
Centers
Network
Time
![Page 66: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/66.jpg)
But by balancing Replication and Partitioning we don’t need all
those hops
![Page 67: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/67.jpg)
Stage 1: Focus on the where clause:
Where Cost Centre = ‘CC1’
![Page 68: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/68.jpg)
Transactions
Cashflows
Mtms
Partitioned Storage
Stage 1: Get the right keys to query the Facts
Join Dimensions in Query Layer
Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
![Page 69: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/69.jpg)
Transactions
Cashflows
Mtms
Partitioned Storage
Stage 2: Cluster Join to get Facts
Join Dimensions in Query Layer
Join Facts across cluster
Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
![Page 70: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/70.jpg)
Stage 2: Join the facts together effi ciently as we know they are
collocated
![Page 71: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/71.jpg)
Transactions
Cashflows
Mtms
Partitioned Storage
Stage 3: Augment raw Facts with relevant
Dimensions
Join Dimensions in Query Layer
Join Facts across cluster
Join Dimensions in Query Layer
Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
![Page 72: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/72.jpg)
Stage 3: Bind relevant dimensions to the result
![Page 73: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/73.jpg)
Bringing it together:
Java client
API
Replicated Dimensions
Partitioned Facts
We never have to do a distributed join!
![Page 74: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/74.jpg)
So all the big stuff is held partitioned
And we can join without shipping keys around and
having intermediate
results
![Page 75: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/75.jpg)
We get to do this…
Trade
Party
Trader
Trade
Party
Trader
![Page 76: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/76.jpg)
…and this…
Trade
Party
Trader Version 1
Trade
Party
Trader Version 2
Trade
Party
Trader Version 3
Trade
Party
Trader Version 4
![Page 77: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/77.jpg)
..and this..
Trade
Party
Trader
Trade
Trade
Party
Party
Party
Trader
Trader
![Page 78: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/78.jpg)
…without the problems of this…
![Page 79: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/79.jpg)
…or this..
![Page 80: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/80.jpg)
..all at the speed of this… well almost!
![Page 81: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/81.jpg)
![Page 82: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/82.jpg)
But there is a fly in the ointment…
![Page 83: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/83.jpg)
I lied earlier. These aren’t all Facts.
Facts
Dimensions
This is a dimension• It has a different
key to the Facts.• And it’s BIG
![Page 84: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/84.jpg)
We can’t replicate really big stuff… we’ll run out of space => Big Dimensions are a problem.
![Page 85: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/85.jpg)
Fortunately there is a simple solution!
The Connected Replication
Pattern
![Page 86: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/86.jpg)
Whilst there are lots of these big dimensions, a large majority are never used. They are not all “connected”.
![Page 87: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/87.jpg)
If there are no Trades for Goldmans in the data store then a Trade Query will never need the Goldmans Counterparty
![Page 88: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/88.jpg)
Looking at the Dimension data some are quite large
![Page 89: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/89.jpg)
But Connected Dimension Data is tiny by comparison
![Page 90: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/90.jpg)
One recent independent study from the database community showed that 80% of data remains unused
![Page 91: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/91.jpg)
So we only replicate
‘Connected’ or ‘Used’ dimensions
![Page 92: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/92.jpg)
As data is written to the data store we keep our ‘Connected Caches’ up
to dateD
ata
Layer
Dimension Caches
(Replicated)
Transactions
Cashflows
Pro
cessin
g
Layer
Mtms
Fact Storage(Partitioned)
As new Facts are added relevant Dimensions that they reference are moved to processing layer caches
![Page 93: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/93.jpg)
The Replicated Layer is updated by recursing through the arcs on the domain model when facts change
![Page 94: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/94.jpg)
Saving a trade causes all it’s 1st level references to be triggered
Trade
Party
Alias
Source
Book
Ccy
Data Layer(All Normalised)
Query Layer(With connected dimension Caches)
Save Trade
Partitioned Cache
Cache Store
Trigger
![Page 95: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/95.jpg)
This updates the connected caches
Trade
Party
Alias
Source
Book
Ccy
Data Layer(All Normalised)
Query Layer(With connected dimension Caches)
![Page 96: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/96.jpg)
The process recurses through the object graph
Trade
Party
Alias
Source
Book
Ccy
Party
LedgerBook
Data Layer(All Normalised)
Query Layer(With connected dimension Caches)
![Page 97: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/97.jpg)
‘Connected Replication’A simple pattern which recurses through the foreign keys in the
domain model, ensuring only ‘Connected’ dimensions are
replicated
![Page 98: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/98.jpg)
With ‘Connected Replication’ only 1/10th of the data
needs to be replicated (on
average).
![Page 99: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/99.jpg)
Limitations of this approach
•Data set size. Size of connected dimensions limits scalability.
• Joins are only supported between “Facts” that can share a partitioning key (But any dimension join can be supported)
![Page 100: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/100.jpg)
Everything is Java
Java client
APIJava schema Java ‘Stored
Procedures’ and
‘Triggers’
![Page 101: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/101.jpg)
API – Queries utilise a fluent interface
![Page 102: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/102.jpg)
Performance
Query with more than
twenty joins conditions:
2GB per min / 250Mb/s
(per client)3ms latency
![Page 103: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/103.jpg)
Conclusion
Data warehousing, OLTP and Distributed caching fields are all converging on in-memory architectures to get away from disk induced latencies.
![Page 104: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/104.jpg)
Conclusion
Shared-Nothing architectures are always subject to the distributed join problem if they are to retain a degree of normalisation.
![Page 105: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/105.jpg)
Conclusion
We present a novel mechanism for avoiding the distributed join problem by using a Star Schema to define whether data should be replicated or partitioned.
Partitioned Storage
![Page 106: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/106.jpg)
Conclusion
We make the pattern applicable to ‘real’ in-memory data models by only replicating objects that are actually used: the Connected Replication pattern.
![Page 107: Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability | Ben Stopford](https://reader035.vdocuments.site/reader035/viewer/2022062703/5555cd9cd8b42aaf158b4b7f/html5/thumbnails/107.jpg)
The End• Further details online
http://www.benstopford.com (there are textual and slideshare versions of this talk)
• A big thanks to the team in both India and the UK who built this thing.
• Questions?