aerospike hybrid memory architecture
TRANSCRIPT
High PerformanceNoSQL Database
Powering New Opportunities at Scale
Ask The Experts: Architectural Overview
Overview
■ Overview ■ Database landscape ■ Use cases
■ Architecture ■ Storage ■ Indexes and Operations ■ Cross Datacenter Replication ■ Questions
Response time: Hours, Weeks TB to PB Read Intensive
TRANSACTIONS (OLTP)
Response time: Seconds Gigabytes of data
Balanced Reads/Writes
ANALYTICS (OLAP)
STRUCTURED DATA
Response time: Seconds Terabytes of data
Read Intensive
BIG DATA ANALYTICS
Real-time Transactions Response time: < 5 ms 1-100 TB Balanced Reads/Writes 24x7x365 Availability
UNSTRUCTURED DATA
REAL-TIME BIG DATA
Database Landscape
Next Generation Systems of Engagement – An Emerging Market with Multiple Technologies
Aerospike Delivers Predictable Performance, Highest Availability, and Lowest TCO
Systems of Engagement - TCO
TCO ($)
Scale TB
Systems of Engagement – Many Choices
Alternative TCO
Aerospike TCO
Speed TPS
Scale TB
Significant functional overlap - Commodity DB problem set
Unique Functional Capabilities and High Value Problem Set
High Performance NoSQL +
■ Unlimited Key Value pairs, record size up to 128KB - 1MB.
■ Complex & Scalar Types - integer, double, string, blob, list, map, geospatial.
■ Distributed Queries on secondary
indices (exact match, integer range, geospatial queries).
■ User Defined Functions extend the database.
■ Patented Indexed Map-Reduce – distributed queries can be filtered, transformed, aggregated, and reduced.
Use cases
6
MILLIONS OF CONSUMERS BILLIONS OF DEVICES
APP SERVERS
DATA WAREHOUSE INSIGHTS
Advertising Technology Stack
WRITE CONTEXT
In-memory NoSQL
WRITE REAL-TIME CONTEXT READ RECENT CONTENT PROFILE STORE Cookies, email, deviceID, IP address, location, segments, clicks, likes, tweets, search terms... REAL-TIME ANALYTICS Best sellers, top scores, trending tweets
BATCH ANALYTICS Discover patterns, segment data: location patterns, audience affinity
Currently about 3.0M / sec in North American
Challenge • Billions of users & cookies across the internet • Accessible using provisioning applications
(self-serve and through support personnel) • Real-time algorithms used for targeting, offers.
Need for Extremely High Availability, Reliably, Low latency
• 10’s TBs of data • 1B ~ 10B objects • 1M ~ 10M TPS
Selected NoSQL • Clustered HA system • Predictable low latency at high throughput • Highly-available and reliable on failure • Cross data center (XDR) support
AdTech – Targeting, Bidding, Programmatic
INTERNETAD EXCHANGE
BIDDINGAPPLICATION
SEARCHES
VISITSTIME ON PAGE
AUDIENCE
HISTORICALDATA
BEHAVIOR MODELSMACHINELEARNING
Travel Portal
PRICING DATABASE (RATE LIMITED)
Poll for Pricing Changes
PRICING DATA
Store Latest Price
SESSION MANAGEMENT
Session Data
Read Price
XDR
Airlines forced interstate banking Legacy mainframe technology Multi-company reservation and pricing Requirement: 1M TPS allowing overhead
Travel App
Financial Services – Intraday Positions
10M+ user records Primary key access 1M+ TPS
• Challenge – DB2 stores positions for 10 Million
customers – Value-at-risk calculations in minutes,
not hours – Consistent view of trade state across all
applications – Must update stock prices, show balances
on 300 positions, process 250M transactions, 2 M updates/day
– Cache uneconomical – 150 servers growing to 1000
• Need to scale reliably – 3 à 13 TB – 100 à 400 Million objects – 200k à I Million TPS
• Selected NoSQL – Flash – Predictable Low latency at High
Throughput – Immediate consistency – Cross data center (XDR) support – 10 Server Cluster
IBM DB2 (MAINFRAME)
Read/Write
Start of Day Data Loading
End of Day Reconciliation
Query REAL-TIME DATA FEED
ACCOUNT POSITIONS
XDR
QOS & Real-Time Billing for Telcos
Challenge • Per-account routing rules win edge systems • Traffic shaping to implement account policies • Accessible using provisioning applications
(self-serve and through support personnel) Need for Extremely High Availability, Reliably, Low latency
• TBs of data • 10-100M objects • 10-200K TPS
Selected NoSQL • Clustered system • Predictable low latency at high throughput • Highly-available and reliable on failure • Cross data center (XDR) support
SOURCE DEVICE/USER DESTINATION Real-Time
Auth. QoS Billing
Request Execute Request
Real-Time Checks Config Module App
Update Device User Setting
Hot-Standby
XDR
Traditional SOE Architecture Has Significant Limitations
Challenges: • Complex • Maintainability • Durability • Consistency • Scalability • Cost ($) • Data Lag Caching Layer
Operational Database
Legacy RDBMS HDFS BASED
Fast speed – Consumer Scale
Real-time Consumer Facing
Pricing / Inventory/Billing
Real-time Decisioning
Streaming Data
Legacy Database (Mainframe)
RDBMS Database
Transactional Systems
Enterprise Environment
XDR
Aerospike
Hybrid Memory Systems - Enabling a New Class of Real-time Applications
Aerospike Delivers Predictable Performance, Highest Availability, and Lowest TCO
Legacy Database (Mainframe)
RDBMS Database
Transactional Systems
Enterprise Environment
Powered by High Performance NoSQL
Fast speed – Consumer Scale
Hybrid Memory Database
Benefits: • Simplicity • Maintainability • Durability • Consistency • Scalability • Cost ($) • Data Lag Reduced
Real-time Consumer Facing
Pricing / Inventory/Billing
Real-time Decisioning
Streaming Data
Legacy RDBMS HDFS BASED
Architecture
Architecture – The Big Picture
1) No Hotspots – Distributed Hash Table simplifies data partitioning
2) Smart Client – 1 hop to data, no load balancers
3) Shared Nothing Architecture, every node is identical
4) Smart Cluster, Zero Touch
– auto-failover, rebalancing, rack aware, rolling upgrades
5) Transactions and long-running tasks prioritized in real-time
6) XDR – sync replication across data centers ensures Zero Downtime
How Data is Organized
Aerospike RDBMS
Namespace Tablespace or Database
Set Table
Record Row
Bin Column
Bin type
Integer
Double
String
BLOB
List
Map / SortedMap
GeoJSON
Smart Client™
■ The Aerospike Client is implemented as a library, JAR or DLL, and consists of 2 parts: ■ Operation APIs – These are the operations that you can execute on the
cluster – CRUD+ etc. ■ First class observer of the Cluster – Monitoring the state of each node and
aware on new nodes or node failures.
Smart Client - Distributed Hash table
■ Distributed Hash Table with No Hotspots ■ Every key hashed with RIPEMD160
into an ultra efficient 20 byte (fixed length) string ■ Hash + additional (fixed 64 bytes) data
forms index entry in RAM ■ Some bits from hash value are used to
calculate the Partition ID (4096 partitions) ■ Partition ID maps to Node ID in the cluster
■ 1 Hop to data ■ Smart Client simply calculates Partition
ID to determine Node ID ■ No Load Balancers required
Even record distribution
Node A Node B Node C
Z
Z’
Y
Y’
X
X’
AerospikeClient
Application
Automatic rebalancing
Adding, or removing a node, the cluster automatically rebalances 1. Cluster discovers new node via gossip
protocol 2. Paxos vote determines new data
organization 3. Partition migrations scheduled 4. When a partition migration starts,
write journal starts on destination 5. Partition moves atomically 6. Journal is applied and source data deleted After migration is complete, the cluster is evenly balanced.
Predictable high performance
Data is distributed evenly across nodes in a cluster using the Aerospike Smart Partitions™ algorithm. ■ RIPEMD160 (no collisions yet found) ■ 4096 Data Partitions ■ Even distribution of
■ Partitions across nodes ■ Records across Partitions ■ Data across Flash devices
■ Primary and Replica Partitions
Even Data Distribution
Massively Parallel
Automatic Distribution of Data • Even amount of data on all nodes and all drives • All hardware used equally • Load on all servers is balanced • No “hot spots” • No configuration changes as workload or use
case changes
Smart Clients • Single “hop” from client to server • Cluster-spanning operations (scan, query, batch)
sent to all processing nodes for parallel processing.
Scale up Architecture - Server internals TC
P/IP
Soc
ket
Flash Storage
Service Threads
Service Queues
Transaction Threads
Predictable Performance
DIGEST & TREE INFO
RECORD METADATA
STORAGE POINTER
Reads
Single hop DRAM Read
OWNING SERVER PRIMARY INDEX STORAGE
DIGEST & TREE INFO
RECORD METADATA
STORAGE POINTER
Writes
Single hop DRAM Write
OWNING SERVER PRIMARY INDEX MEMORY BUFFER
Flush
ASYNC STORAGE
DIGEST & TREE INFO
RECORD METADATA
STORAGE POINTER
DRAM Write
REPLICA SERVER PRIMARY INDEX MEMORY BUFFER
Flush
ASYNC STORAGE
Synchronous Replica Write, Single hop
Predictable Performance
Performance Built In • Written in C with memory-optimized libraries => No garbage collection • Continual defragmentation of storage => No compactions • Known master for any piece of data => No quorum reads • Designed as a distributed database => Networking primary consideration
Storage Optimizations • Writes done to memory buffer => Avoid storage slowdown • Storage used in “block” mode => No file system overhead • Reads and writes striped across devices => Concurrent use of hardware
Smart Clients • Single “hop” from client to server
Data Consistency
• Written data should be immediately consistent within a cluster without introducing additional latency
• Mixed workloads (true concurrent reads/writes) should not cause issues
• Written data should be asynchronously written to remote clusters
Data Consistency
OWNING SERVER REPLICA SERVER
Local Cluster
Remote Cluster
ASYNC REPLICATION
SYNCHRONOUS REPLICATION
XD
R
WRITE
READ
Storage
Data Storage Layer – Hybrid Architecture
Data in RAM
Data in RAM is very fast – at a price ■ Indexes and Data both in-memory ■ $$$ (great < 100G, Cloud) ■ More servers ■ Super fast ■ Optional HDD as backing store
Data on Flash / SSD
■ Record data stored contiguously ■ 1 read per record
■ Automatic continuous defragment ■ Data written in flash optimal blocks ■ Automatic distribution across drives ■ Writes buffered
BLOCK INTERFACE
SSD SSD SSD
AEROSPIKE
HYBRID MEMORY SYSTEM™
10x LOWER TCO10x Fewer
Indexes and Operations
Indexes in DRAM, Data on SSD
• Small amount of DRAM => avoid cost and server sprawl
• No concept of cache misses => Predictable, low latency performance on NVMe/SSD
Primary Index
Primary index ■ DHT of rbTrees (one per partition)
■ Index entry ■ 64 bytes ■ Write generation ■ Time To Live ■ Last Update Time ■ Storage address ■ Uses shared memory for
Fast Restart
Key Value operations using the Primary Index
■ Put ■ Exists ■ Get ■ CAS ■ Increment (counters) ■ Append/Prepend ■ List Operations ■ SortedMap Operations ■ Touch ■ Delete ■ Batch Read/Exists ■ Scan
Secondary Indexes
■ Bin (Column) indices ■ Declarative index
■ String, Integer, List, Map Keys, ■ Map Values, GeoJSON
■ In RAM – fast ■ Multi-node
■ Co-located with primary index ■ Reference local data only ■ Index creation
■ Tools: AQL, ascli ■ Client API – developer only
Queries on Secondary Indexes
A query is a value based lookup using a secondary index similar to a SQL select statement. The query is sent to all nodes in the cluster in parallel ■ Scatter-gather ■ Multi-threaded
Best for “low selectivity” indices Good for “high selectivity” indices
Selectivity =Cardinality / Rows*100
SECONDARY INDEX
PRIMARY INDEX
UDF UDF UDF
RECORD RECORD RECORD RECORD
SSD
SSD
DRAM
… … …
Cross Datacenter Replication - XDR
XDR Architecture
Each node in the cluster Distributed clusters
XDR Topologies
Star Replication
Simple Active-Passive Simple Active-Active
More Complex Topology
Failure Handling
Node failure within a cluster – nodes with replica data will continue Link failure – XDR keeps track of link failures and data to be shipped over that link. It will recover when the link comes up.
Node failure in a Cluster Link failure between Clusters
Aerospike – Enabling Your Digital Transformation
Powered by High Performance NoSQL
Aerospike – The Next Generation Operational Database
TRUE HYBRID MEMORY ARCHITECTURE• No cache required – simpler architecture! Smaller Server Footprint• Patented Flash Optimization – Log structured File System• Record Oriented, Schema Free NoSQL KV Store
PREDICTABLE PERFORMANCE• True Real Time DB engine, multi threaded, massively parallel• DRAM or Hybrid DRAM/Flash for Persistence• Stable, Low Latency and high throughput under any condition• Deployable on Bare Metal, virtualized, containerized, or Cloud
DYNAMIC CLUSTER MANAGEMENT• Highest Uptime & Availability (5 nines plus), Scalable • Automatic DB Cluster formation, self healing and dynamic sharding • Cross Data Center Replication (XDR)
INTELLIGENT CLIENTS• Machine Learning• Broad language support (C/C++, Java,C#, Python, Go, Node.js, PHP)• Patented functionality, DB aware Clients, No load balancers required• Rich API’s - Accelerated development
TCO• Optimized for Flash and DRAM• Demonstrated 10:1 price performance savings• Up to 10x reduction in servers deployed• Huge operational efficiency – “Set it and Forget it”
$
High PerformanceNoSQL Database
Powering New Opportunities at Scale
@aerospikedb
NEXT STEPS: See how much you can save with Aerospike: http://www.aerospike.com/tco-calculator/ Ready to get started? http://www.aerospike.com/quick-start/ If you have any questions or want to further explore if Aerospike is right for you, contact us: [email protected]