building a time series database

47
© Man 2015 Building a time series database … 10^12 rows and counting James Blackburn @jimmybb @ManAHLTech

Upload: james-blackburn

Post on 06-Jan-2017

1.641 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Building a Time Series Database

© Man 2015

Building a time series database… 10^12 rows and counting

James Blackburn

@jimmybb

@ManAHLTech

Page 2: Building a Time Series Database

2

Agenda

Data Storage At AHL1. AHL2. Data: size and shape3. Implementation: Arctic4. Performance5. Conclusion

Page 3: Building a Time Series Database

3

AHL Systematic Fund Management

© Man 2015

Page 4: Building a Time Series Database

4© Man 2015

AHL Systematic Fund Management

Page 5: Building a Time Series Database

5© Man 2015

AHL Systematic Fund Management

Page 6: Building a Time Series Database

6© Man 2015

AHL Systematic Fund Management

Page 7: Building a Time Series Database

Quant researchers • Interactive work – latency sensitive• Batch jobs run on a cluster – maximize throughput• Historical data• New data• ... want control of storing their own data

Trading system• Auditable – SVN for data• Stable• Performant

7

Overview – Data consumers

Page 8: Building a Time Series Database

8

AHL’s Data Pipeline

Page 9: Building a Time Series Database

…[2-2092492678] FFIM6.NaE 2015-11-11 13:38:18.330 4 UPDATE TIMACT: '13:38', BID: 6192.0, QUOTIM: '13:38:18', QUOTIM_MS: 49098000.0, BIDSIZE: 1.0, EXCHTIM: '13:38:18'[2-2092492759] FFIM6.NaE 2015-11-11 13:38:18.676 16 UPDATE TIMACT: '13:38', ASKSIZE: 1.0, QUOTIM: '13:38:18', QUOTIM_MS: 49098000.0, ASK: 6200.5, EXCHTIM: '13:38:18'[2-2092493019] FFIM6.NaE 2015-11-11 13:38:20.333 14 UPDATE TIMACT: '13:38', BID: 6192.5, QUOTIM: '13:38:20', QUOTIM_MS: 49100000.0, BIDSIZE: 1.0, EXCHTIM: '13:38:20'[2-2092493079] FFIM6.NaE 2015-11-11 13:38:20.685 2 UPDATE TIMACT: '13:38', ASKSIZE: 1.0, QUOTIM: '13:38:20', QUOTIM_MS: 49100000.0, ASK: 6201.0, EXCHTIM: '13:38:20'…

9

Tick Data

© Man 2015

Page 10: Building a Time Series Database

10

TimeSeries – single stock

© Man 2015

Page 11: Building a Time Series Database

Data sizes we deal with…• ~1MB x 1000s 1x a day price data 10k rows (30 years)• ~0.5GB x 1000s 1-minute data 4M rows (20 years)• ~1GB x 1000s 10k x 10k data matrices 100M cells (30 years)• ~30TB Tick data 100k msgs/s

1.2B msgs/day

… and different shapes• Time series of prices• Event data• News data• Metadata• What’s next?

11

Data sizes

Page 12: Building a Time Series Database

12

lib.read(‘US Equity Adjusted Prices')Out[4]: <class 'pandas.core.frame.DataFrame'>DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00Columns: 8103 entries, AST10000 to AST9997dtypes: float64(8631)

Problems - Scale

Page 13: Building a Time Series Database

13

lib.read(‘US Equity Adjusted Prices')Out[4]: <class 'pandas.core.frame.DataFrame'>DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00Columns: 8103 entries, AST10000 to AST9997dtypes: float64(8631)

Equity Prices: 77M float64s 600MB of data ~= 5Gbits! 600 MB

Problems - Scale

Page 14: Building a Time Series Database

14

SQL

© Man 2015

cx_Oracle

~ 200us per Row.

10k rows => 2.2s

Page 15: Building a Time Series Database

15

SQL

© Man 2015

cx_Oracle

~ 200us per Row.

10k rows => 2.2s

How do we query 4 million rows?

Page 16: Building a Time Series Database

Many different existing data stores• Relational databases• Tick databases• Flat files • HDF5 files• Caches

16

Overview – Databases

Page 17: Building a Time Series Database

Many different existing data stores• Relational databases• Tick databases• Flat files • HDF5 files• Caches

17

Can we build one system to rule them all?

Overview – Databases

Page 18: Building a Time Series Database

18

Implementation

https://github.com/manahl/arctic/

Page 19: Building a Time Series Database

Requirements

• Easy to use – and we mean easy

• Fast – as fast as local files

• Scalable – unbounded in data-size and number of clients

• Agile – any data shape; new shapes; iterative development

• Complete – all data behind the simple API

19

Project Requirements

Page 20: Building a Time Series Database

Goals• 20 years of 1 minute data in <1s• 200 instruments x all history x once a day data <1s

• Single data store for all data types• 1x day data Tick data

• Data versioning + Audit

20

Project Goals

Page 21: Building a Time Series Database

Data bucketed into named Libraries• One minute• Daily• User-data: jbloggs.EOD• Metadata Index

Pluggable Library types:• VersionStore• TickStore• Pickle Store• … pluggable …

https://github.com/manahl/arctic/blob/master/howtos/how_to_custom_arctic_library.py

21

Arctic Libraries

Page 22: Building a Time Series Database

22© Man 2015

Page 23: Building a Time Series Database

23© Man 2015

Page 24: Building a Time Series Database

24© Man 2015

Page 25: Building a Time Series Database

Document ~= Python Dictionary / Java HashMap

Flexible schema Rapid prototyping

OpenSource database

Great support

#1 NoSQL DB (#4 overall) http://db-engines.com/en/ranking

25

Why MongoDB

Page 26: Building a Time Series Database

Arctic key-value store

26

from arctic import Arctic

a = Arctic('research') # Connect to the data store

a.list_libraries() # What data libraries are availablelibrary = a[‘jbloggs.EOD’] # Get a Librarylibrary.list_symbols() # List symbols

library.write(‘SYMBOL’, <TS or other data>) # Writelibrary.read(‘SYMBOL’, version=…) # Read, with an optional version

library.snapshot('snapshot-name') # Create a named snapshot of the libraryLibrary.list_snapshots()

https://github.com/manahl/arctic/blob/master/howtos/how_to_use_arctic.py

Arctic API

Page 27: Building a Time Series Database

27© Man 2015

Page 28: Building a Time Series Database

28

Arctic - TickStore

Arctic(‘localhost’).initialize_library(‘tickdb’, ‘TickStoreV3’)

Page 29: Building a Time Series Database

29

Implementation – slicing a chunk

© Man 2015

Page 30: Building a Time Series Database

30

Implementation – a chunk

{ ID: ObjectId('52b1d39eed5066ab5e87a56d'), SYMBOL: 'symbol' INDEX: Binary(‘…, datetime, …', 0), COLUMNS: { ASK: { DATA: Binary(‘<compressed>', 0), DTYPE: '<f8', ROWMASK: Binary('...', 0) }, ... } START: DateTime(2015-01-01), END: DateTime(2015-11-12), SEGMENT: 1386933906826L, SHA: 1386933906826L, VERSION: 3,}

Page 31: Building a Time Series Database

31

Implementation – TickStore

Sym1

Sym2

Page 32: Building a Time Series Database

32© Man 2015

Page 33: Building a Time Series Database

33

Arctic - VersionStore

Arctic(‘localhost’).initialize_library(‘library’)

Page 34: Building a Time Series Database

34

Implementation – VersionStore

Snap A

Snap B

Sym1, v1

Sym1, v3

Sym2, v4

Sym2, v5

Page 35: Building a Time Series Database

© Man 2014 35

Architecture – final system

ReutersR

MD

S M

essa

ge B

us

Bloomberg

Banks

Kafka Queue

Kafka Queue

16 (micro-)shard cluster

Master + 1 replicaLinux

8 cores256 GB RAM

96TB Disk

Infiniband network LZ4 compressed data

MongoDB Cluster

-> Arctic -> -> Arctic ->

Page 36: Building a Time Series Database

36

Performance

Page 37: Building a Time Series Database

Flat files on NFS – Random market

37

Results – Performance Once a Day Data

Page 38: Building a Time Series Database

HDF5 files – Random instrument

38

Results – Performance One Minute Data

Page 39: Building a Time Series Database

Random E-Mini S&P contract from 2013

© Man 2013 39

Results – TickStore

Page 40: Building a Time Series Database

40

Results – TickStore II

© Man 2015 40

Infinibandsaturated

25x greater tick throughput

With just 2 machines!

Page 41: Building a Time Series Database

Random E-Mini S&P contract from 2013

41

Results – System Load

OtherTick Mongo (x2)N Tasks = 32

Page 42: Building a Time Series Database

42

Performance II

Page 43: Building a Time Series Database

43

TickStore message input

© Man 2015

Page 44: Building a Time Series Database

44

TimeSeries Query Throughput

© Man 2015

Page 45: Building a Time Series Database

Low latency:- 1xDay data: 4ms for 10,000 rows (vs. 2,210ms from SQL) - OneMinute / Tick data: 1s for 3.5M rows Python (vs. 15s – 40s+ from OtherTick)- 1s for 15M rows Java

Parallel Access:- Cluster with 256+ concurrent data access- Consistent throughput – little load on the Mongo server

Efficient:- 10-15x reduction in network load- Negligible decompression cost (lz4: 1.8Gb/s)

45

Conclusion

Page 46: Building a Time Series Database

46

The Future

© Man 2015

- Python 3 support- Mac Support- VersionStore write performance

- OpenSource other native clients- JavaScript- Java- C#

- Contributions Welcome!

Page 47: Building a Time Series Database

47

Questions

@ManAHLTech

@jimmybbJames Blackburn

We’re [email protected]?