giovanni lanzani – sql & nosql databases for data driven applications - nosql matters...

36
Go DataDriven PROUDLY PART OF THE XEBIA GROUP Real time data driven applications Giovanni Lanzani Data Whisperer and SQL vs NoSQL databases

Upload: nosqlmatters

Post on 15-Jul-2015

214 views

Category:

Data & Analytics


6 download

TRANSCRIPT

Page 1: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDrivenPROUDLY PART OF THE XEBIA GROUP

Real time data driven applications

Giovanni LanzaniData Whisperer

and SQL vs NoSQL databases

Page 2: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

Who am I?

2008-2012: PhD Theoretical Physics

2012-2013: KPMG

2013-Now: GoDataDriven

Page 3: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Feedback

@gglanzani

Page 4: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Real-time, data driven app?

• No store and retrieve;

• Store, {transform, enrich, analyse} and retrieve;

• Real-time: retrieve is not a batch process;

• App: something your mother could use:

SELECT attendees FROM NoSQLMatters WHERE password = '1234';

Page 5: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Get insight about event impact

Page 6: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Get insight about event impact

Page 7: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Get insight about event impact

Page 8: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Get insight about event impact

Page 9: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Challenges1. Big Data;2. Privacy;3. Some real-time analysis;

4. Real-time retrieval.

Page 10: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Is it Big Data?Everybody talks about itNobody knows how to do itEveryone thinks everyone else is doing it, so everyone claims they’re doing it…

Dan Ariely

Page 11: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Is it Big Data?

• Raw logs are in the order of 40TB;

• We use Hadoop for storing, enriching and pre-processing.

Page 12: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

2. Privacy

Page 13: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

3. (Some) real-time analysis

Page 14: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

• Harder than it looks;

• Large data;

• Retrieval is by giving date, center location + radius.

4. Real-Time Retrieval

Page 15: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

AngularJS python appREST

Front-end Back-end

JSON

Architecture

Page 16: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

JS-1

Page 17: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

JS-2

Page 18: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

date hour id_activity postcode hits delta sbi

2013-01-01 12 1234 1234AB 35 22 1

2013-01-08 12 1234 1234AB 45 35 1

2013-01-01 11 2345 5555ZB 2 1 2

2013-01-08 11 2345 5555ZB 55 2 2

Data Example

Page 19: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

date hour id_activity postcode hits delta sbi

2013-01-01 12 1234 1234AB 35 22 1

2013-01-08 12 1234 1234AB 45 35 1

2013-01-01 11 2345 5555ZB 2 1 2

2013-01-08 11 2345 5555ZB 55 2 2

Data Example

Page 20: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

helper.py example

def get_statistics(data, sbi): sbi_df = data[data.sbi == sbi] # select * from data where sbi = sbi hits = sbi_df.hits.sum() # select sum(hits) from … delta_hits = sbi_df.delta.sum() # select sum(delta) from … if delta_hits: percentage = (hits - delta_hits) / delta_hits else: percentage = 0

return {"sbi": sbi, "total": hits, "percentage": percentage}

Page 21: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

helper.py example

def get_timeline(data, sbi): df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum) # select sum(hits), sum(delta) from data group by date, hour, sbi return df_sbi

Page 22: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Who has my data?

• First iteration was a (pre)-POC, less data (3GB vs 500GB);

• Time constraints;

• Oeps: everything is a pandas df!

Page 23: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Advantage of “everything is a df ”

Pro:

• Fast!!

• Use what you know

• NO DBA’s!

• We all love CSV’s!

Contra:

• Doesn’t scale;

• Huge startup time;

• NO DBA’s!

• We all hate CSV’s!

Page 24: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

• Set the dataframe index wisely;

• Align the data to the index:

• Beware of modifications of the original dataframe!

source_data.sort_index(inplace=True)

If you want to go down this path

Page 25: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

The reason pandas is faster is because I came up with a better algorithm

If you want to go down this path

Page 26: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

AngularJS python appREST

Front-end Back-end Database

JSON?

If you don’t

Page 27: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

A word about (traditional) databases…

Page 28: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

Db: programming language dict

Page 29: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

Postgres for data driven apps?

Page 30: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

Postgres for data driven apps?

Page 31: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Issues?!

• With a radius of 10km, in Amsterdam, you get 10k postcodes. You need to do this in your SQL:

• Index on date and postcode, but single queries running more than 20 minutes.

SELECT * FROM datapoints WHERE date IN date_array AND postcode IN postcode_array;

Page 32: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries:

SELECT * FROM datapoints WHERE ST_DWithin(lon, lat, 1500) AND dates IN ('2013-02-30', '2013-02-31'); -- every point within 1.5km -- from (lat, lon) on imaginary dates

Postgres + Postgis (2.x)

Page 33: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

Other db’s?

Page 34: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

How we solved it1. Align data on disk by date;2. Use the temporary table trick:

3. Lose precision: 1234AB→1234

CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY); INSERT INTO tmp (postcodes) VALUES postcode_array;

SELECT * FROM tmp JOIN datapoints d ON d.postcode = tmp.postcodes WHERE d.dt IN dates_array;

Page 35: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

Take home messages1. Geospatial problems are “hard” and cam kill your

queries;2. Not everybody has infinite resources: be smart

and KISS!3. SQL or NoSQL? (Size, schema)

Page 36: Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL matters Barcelona 2014

GoDataDriven

We’re hiring / Questions? / Thank you!

@[email protected]

Giovanni LanzaniData Whisperer