handle tbs with $1500 per month
TRANSCRIPT
![Page 1: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/1.jpg)
Handle TBs with $1500/M (or less)
By @hunglin
![Page 2: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/2.jpg)
![Page 3: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/3.jpg)
![Page 4: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/4.jpg)
Because We Are All Curious
![Page 5: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/5.jpg)
And We Have (some useful) Tools Now
![Page 6: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/6.jpg)
What If Data Can Be Easy...
![Page 7: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/7.jpg)
Story at VideoBlocks
![Page 8: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/8.jpg)
Context: Storyteller
● Data Handyman at VideoBlocks
● Organizer of DC Scala meetup
● I LOVE DATA
● Also love Scala and Spark
![Page 9: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/9.jpg)
Context: VideoBlocks
● A media company○ Creative Content Everyone Can Afford
● 3 websites, 100K paid customers.● Hosted on AWS● 16 engineers (total 80 employees)● 9M requests per day, peak at 300 reqs/sec● deploy about 5 times a week
![Page 10: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/10.jpg)
We Want to Know Everything About Our (Potential) Customers
![Page 11: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/11.jpg)
Our Data Issues
● Data everywhere (data silo)
● Data integration (mismatched
format like "" or 0 vs null)
● Data latency: sub-second,
sub-minute, sub-hour or
sub-day are very different.
![Page 12: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/12.jpg)
● Use S3 as data lake - load mysql, mongo, click stream, adwords, facebook ad, ... onto S3. it's the source of truth.
● Use redshift as SQL interface of S3 data.● Use SQL to process data.● Run nightly job to create materialized views (aggregated
data) for query speed.● S3/redshift is the engine of all data tools: spark, python,
R, dashboard, alert system.
Our Solutions
![Page 13: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/13.jpg)
Click streams to redshift, How?
an EC2 instance
loggly container
fluentd container
webhead containerkinesis-firehose
![Page 14: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/14.jpg)
Event-Log-Loader
Loader
pull
![Page 15: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/15.jpg)
Wait! data format doesn't match
![Page 16: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/16.jpg)
create_temp_table.sql
create table {{tempEventTable}}_dup ( "name" varchar(40), "uuid" varchar(40), "requestid" varchar(40), "country" varchar(40), "subdomain" varchar(40), "vid" varchar(70), "mid" int, "payload" varchar(65535), "date" timestamp, primary key ("uuid"))distkey("uuid")sortkey("uuid");
![Page 17: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/17.jpg)
load.sql
copy {{tempEventTable}}_dup from '{{dataUrl}}' credentials '{{credentials}}' json 'auto' gzip timeformat 'epochmillisecs' maxerror 10000;
select distinct * into table {{tempEventTable}} from {{tempEventTable}}_dup;
![Page 18: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/18.jpg)
process.sql
insert into event.{{ siteName }}_page_view (uuid, request_id, vid, mid, date, uri, referrer_uri, campaign, ...) select uuid, requestid, vid, mid, date, etl_text(json_extract_path_text(payload, 'uri'), 1000), etl_text(json_extract_path_text(payload, 'referrerUri'), 200), etl_text(json_extract_path_text(payload, 'utm', 'campaign'), 80),... from {{ tempEventTable }} where name = '{{ eventName }}' and length(vid)=64 and uuid not in (select uuid from event.{{ siteName }}_page_view where date >= (select min(date) from {{ tempEventTable }}));
![Page 19: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/19.jpg)
SQL?! Is it 1990? Aren't we in NoSql era already?!
![Page 20: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/20.jpg)
NoSql means Not yet SQL
![Page 21: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/21.jpg)
● Scalable by default
● Understandable / Editable by
product/analytics/management team
● Scalable Cost - $115/M per node, for
videoblocks, we started from 2 nodes cluster
($230/M) to 12 nodes cluster ($1380/M)
● On demand processing power - Teams can bring
up cluster from snapshot to run data test, and
kill it after get the result.
Benefits of this approach
![Page 22: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/22.jpg)
Things to improve
● SQL code is ugly, hard to unit test and debug● Performance issues
○ mismatched sortkey or distkey
○ inefficient queries
● Read and write on the same cluster (resource management on redshift cluster)
○ Write at night, read in the morning (if one day data latency is OK)
○ Use multiple redshift clusters (more expensive)
![Page 23: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/23.jpg)
In Conclusion
● Redshift is cost efficient.
● SQL is "still" the most common
data language.
● SQL is also the most supported
data language.
● Scalable by default (with
caveats like all other systems)
● On demand data + processing
power using snapshot -
multiple stage of deployment.
● Good enough UI for you to get
a high level idea of the cluster.
● Can only use SQL (compare to
spark cluster)
● SQL is not the ideal
programming language
● Monitoring, performance
tuning, and debugging need
some trial and learns (better
than other systems IMO)
![Page 24: Handle TBs with $1500 per month](https://reader034.vdocuments.site/reader034/viewer/2022051504/58edd8021a28ab3d458b4619/html5/thumbnails/24.jpg)
Questions?