handle tbs with $1500 per month

Handle TBs with $1500/M (or less)

By @hunglin

Because We Are All Curious

And We Have (some useful) Tools Now

What If Data Can Be Easy...

Story at VideoBlocks

Context: Storyteller

● Data Handyman at VideoBlocks

● Organizer of DC Scala meetup

● I LOVE DATA

● Also love Scala and Spark

Context: VideoBlocks

● A media company○ Creative Content Everyone Can Afford

● 3 websites, 100K paid customers.● Hosted on AWS● 16 engineers (total 80 employees)● 9M requests per day, peak at 300 reqs/sec● deploy about 5 times a week

We Want to Know Everything About Our (Potential) Customers

Our Data Issues

● Data everywhere (data silo)

● Data integration (mismatched

format like "" or 0 vs null)

● Data latency: sub-second,

sub-minute, sub-hour or

sub-day are very different.

● Use S3 as data lake - load mysql, mongo, click stream, adwords, facebook ad, ... onto S3. it's the source of truth.

● Use redshift as SQL interface of S3 data.● Use SQL to process data.● Run nightly job to create materialized views (aggregated

data) for query speed.● S3/redshift is the engine of all data tools: spark, python,

R, dashboard, alert system.

Our Solutions

Click streams to redshift, How?

an EC2 instance

loggly container

fluentd container

webhead containerkinesis-firehose

Event-Log-Loader

Loader

pull

Wait! data format doesn't match

create_temp_table.sql

create table {{tempEventTable}}_dup ( "name" varchar(40), "uuid" varchar(40), "requestid" varchar(40), "country" varchar(40), "subdomain" varchar(40), "vid" varchar(70), "mid" int, "payload" varchar(65535), "date" timestamp, primary key ("uuid"))distkey("uuid")sortkey("uuid");

load.sql

copy {{tempEventTable}}_dup from '{{dataUrl}}' credentials '{{credentials}}' json 'auto' gzip timeformat 'epochmillisecs' maxerror 10000;

select distinct * into table {{tempEventTable}} from {{tempEventTable}}_dup;

process.sql

insert into event.{{ siteName }}_page_view (uuid, request_id, vid, mid, date, uri, referrer_uri, campaign, ...) select uuid, requestid, vid, mid, date, etl_text(json_extract_path_text(payload, 'uri'), 1000), etl_text(json_extract_path_text(payload, 'referrerUri'), 200), etl_text(json_extract_path_text(payload, 'utm', 'campaign'), 80),... from {{ tempEventTable }} where name = '{{ eventName }}' and length(vid)=64 and uuid not in (select uuid from event.{{ siteName }}_page_view where date >= (select min(date) from {{ tempEventTable }}));

SQL?! Is it 1990? Aren't we in NoSql era already?!

NoSql means Not yet SQL

● Scalable by default

● Understandable / Editable by

product/analytics/management team

● Scalable Cost - $115/M per node, for

videoblocks, we started from 2 nodes cluster

($230/M) to 12 nodes cluster ($1380/M)

● On demand processing power - Teams can bring

up cluster from snapshot to run data test, and

kill it after get the result.

Benefits of this approach

Things to improve

● SQL code is ugly, hard to unit test and debug● Performance issues

○ mismatched sortkey or distkey

○ inefficient queries

● Read and write on the same cluster (resource management on redshift cluster)

○ Write at night, read in the morning (if one day data latency is OK)

○ Use multiple redshift clusters (more expensive)

In Conclusion

● Redshift is cost efficient.

● SQL is "still" the most common

data language.

● SQL is also the most supported

data language.

● Scalable by default (with

caveats like all other systems)

● On demand data + processing

power using snapshot -

multiple stage of deployment.

● Good enough UI for you to get

a high level idea of the cluster.

● Can only use SQL (compare to

spark cluster)

● SQL is not the ideal

programming language

● Monitoring, performance

tuning, and debugging need

some trial and learns (better

than other systems IMO)

Questions?

handle tbs with $1500 per month

Data & Analytics