light up your dark data

Post on 16-Apr-2017

2.409 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

QuantCon“Light Up Your Dark Data”

April 2016

2

What is dark data?

SQL

CSV

REST

JSON

SQL

CSV

REST

JSON

SQL

CSV

SQL

CSV

3

Example Datasets

Trade History

Signal History

Clearing Data

Log Files

Ref Data

Corp Actions

Market Data

Models

Firm Generated Vendor Generated

4

Compounding ChallengesAccumulates

Quickly

Disparate StorageDifferent Vendors

Format Changes

Ad-hoc Usage

Urgent!

5

Workflow

Find Data

Ad-Hoc ETL

Store / CopyAnalysis

Report

6

Sample Environment

Oracle MySQL MSSQL KDB ZIPCSV

SQL

Python

DSL

R Matlab

C++ Java

Storage

ETL

Analysis

REST

7

Independent First Class Citizens

Expression

ComputeData

8

DatashapeStructured data description language

http://datashape.pydata.org

9

Datashape Example daily_bars: var * { date: string, symbol: string, open: float64, high: float64, low: float64, close: float64, volume: int64, }

Language, compute, and storage independent

10

Blaze

Write expressions independent of storage system

Push computations to the data

Lazy evaluation

Pandas-like API

11

Blazehttp://blaze.pydata.org/

12

Blaze Expressions

13

Flat File Repositories

Many directories and files

Dictated structure

Naming convention part of dataset

Requires one off ad-hoc scripts

14

Vendor - directory structure/daily/us/nasdaq stocks//daily/us/nasdaq stocks/1//daily/us/nasdaq stocks/2/

osn.us.txtostk.us.txt…

zyne.us.txt/daily/us/nyse etfs//daily/us/nyse stocks/1//daily/us/nyse stocks/2/

Contains ~8400 individual files

15

Vendor – file contents

Date,Open,High,Low,Close,Volume,OpenInt20151111,18.5,25.9,18,24.5,1584600,020151112,24.25,27.12,22.5,25,83000,020151113,25.47,26.2,24.55,25.26,67300,020151116,25.01,26.19,24.13,25.02,16900,020151117,24.46,25.51,24.38,24.62,25900,020151118,24.62,26.31,24.06,25,111100,020151119,24.85,26,24.71,25.9,113100,0…

Symbol is not contained within the individual data files

/daily/us/nasdaq stocks/1/aaap.us.txt

16

Luxsource: "lux://global-equities/data/daily/us/nasdaq stocks" extractor: "{}/{Symbol}.{Region}.txt"

Date,Open,High,Low,Close,Volume,OpenInt,Symbol,Region20151111,18.5,25.9,18,24.5,1584600,0,aaap,us20151112,24.25,27.12,22.5,25,83000,0,aaap,us20151113,25.47,26.2,24.55,25.26,67300,0,aaap,us…20160322,11.56,11.98,10.8894,11.09,517604,0,zyne,us20160323,11.3,11.72,9.5,9.75,489743,0,zyne,us20160324,9.5,10.24,9.22,9.64,188512,0,zyne,us

One dataset with ~5.5 million rows

17

Lux Benefits

Combines individual files

No separate ETL or storage

Names become part of data

Optimized compute

18

Anaconda Mosaic

Interactive exploration

Intuitive interface

Advanced visualizations

Catalog of datasets and expressions

Provenance and Governance

19

Live Walkthrough

20

Project References

• Anaconda Mosaic - http://know.continuum.io/Anaconda-Mosaic

• Blaze Ecosystem - http://blaze.pydata.org• Bokeh - http://bokeh.pydata.org

top related