aws summit tel aviv - startup track - data analytics & big data
DESCRIPTION
TRANSCRIPT
![Page 1: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/1.jpg)
AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel
Jan Borch | AWS Solutions Architect
Data Analytics on BigData
![Page 2: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/2.jpg)
GENERATE STORE ANALYZE SHARE
![Page 3: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/3.jpg)
THE COST OF DATA
GENERATION IS FALLING
![Page 4: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/4.jpg)
![Page 5: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/5.jpg)
![Page 6: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/6.jpg)
Progress is not evenly distributed
1980 Today
14,000,000$/TB
100MB
4MB/s
30$/TB
3TB
200MB/s
30,000 X
50 X
450,000 ÷
![Page 7: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/7.jpg)
THE MORE DATA YOU COLLECT
THE MORE VALUE YOU CAN
DERIVE FROM IT
![Page 8: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/8.jpg)
![Page 9: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/9.jpg)
![Page 10: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/10.jpg)
GENERATE STORE ANALYZE SHARE
Lower cost,
higher throughput
![Page 11: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/11.jpg)
GENERATE STORE ANALYZE SHARE
Lower cost,
higher throughput
Highly
constrained
![Page 12: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/12.jpg)
Generated data
Available for analysis
DATA VOLUME
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
![Page 13: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/13.jpg)
GENERATE STORE ANALYZE SHARE
![Page 14: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/14.jpg)
GENERATE STORE ANALYZE SHARE
ACCELERATE
![Page 15: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/15.jpg)
+ ELASTIC AND HIGHLY SCALABLE
+ NO UPFRONT CAPITAL EXPENSE
+ ONLY PAY FOR WHAT YOU USE
+ AVAILABLE ON-DEMAND
= REMOVE CONSTRAINTS
![Page 16: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/16.jpg)
GENERATE STORE ANALYZE SHARE
AWS EC2
AWS CloudFront
![Page 17: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/17.jpg)
![Page 18: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/18.jpg)
• Fluentd
• Flume
• Scribe
• Chukwa
• LogStash
{output{ s3 {
bucket => myBucket,
aws_credential_file => ~/cred.json
size_file=> 120MB
}}
![Page 19: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/19.jpg)
![Page 20: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/20.jpg)
![Page 21: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/21.jpg)
![Page 22: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/22.jpg)
“Poor man’s Analytics”
![Page 23: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/23.jpg)
![Page 24: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/24.jpg)
Embed poor-man pixel
http://www.poor-man-analytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban.com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr=-&utmp=%2F&utmac=UA-7019765-1&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B%2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(referral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analytics-architecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~
![Page 25: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/25.jpg)
GENERATE STORE ANALYZE SHARE
![Page 26: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/26.jpg)
GENERATE STORE ANALYZE SHARE
AWS Import / Export
AWS Direct Connect
AWS Elastic Map Reduce
![Page 27: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/27.jpg)
Generated and stored in AWS
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
Regional replication of AMIs and snapshots
![Page 28: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/28.jpg)
Aggregation with S3Distcp
![Page 29: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/29.jpg)
S3distcp on EMR job sample
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \
/home/hadoop/lib/emr-s3distcp-1.0.jar \
--args \
'--src,s3://myawsbucket/cf,\
--dest,s3://myoutputbucket/aggregate ,\
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,\
--targetSize,128,\
--outputCodec,lzo,\
--deleteOnSuccess'
![Page 30: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/30.jpg)
GENERATE STORE ANALYZE SHARE
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
![Page 31: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/31.jpg)
![Page 32: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/32.jpg)
![Page 33: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/33.jpg)
AMAZON S3 SIMPLE STORAGE SERVICE
![Page 34: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/34.jpg)
![Page 35: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/35.jpg)
![Page 36: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/36.jpg)
AMAZON
DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED
NoSQL DATABASE SERVICE
![Page 37: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/37.jpg)
DURABLE &
AVAILABLE CONSISTENT, DISK-ONLY
WRITES (SSD)
![Page 38: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/38.jpg)
LOW LATENCY AVERAGE READS < 5MS,
WRITES < 10MS
![Page 39: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/39.jpg)
NO ADMINISTRATION
![Page 40: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/40.jpg)
ad-id advertiser max-price imps to
deliver
imps
delivered
1 AAA 100 50000 1200
2 BBB 150 30000 2500
user-id attribute1 attribute2 attribute3 attribute4
A XXX XXX XXX XXX
B YYY YYY YYY YYY
not many
rows
so many
rows
frequent
update
(near realtime)
batch manner update
Ads
Profiles
Very general table structure
![Page 41: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/41.jpg)
500,000 WRITES PER SECOND
DURING SUPER BOWL
![Page 42: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/42.jpg)
![Page 43: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/43.jpg)
![Page 44: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/44.jpg)
AMAZON
GLACIER reliable long term archiving
![Page 45: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/45.jpg)
AMAZON S3 Archive to
Amazon Glacier
S3 Lifecycle policies
If object older than 5 month
![Page 46: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/46.jpg)
AMAZON S3
Delete object from S3
S3 Lifecycle policies
/dev/null
If object older than 5 month
If object older than 1 year
![Page 47: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/47.jpg)
![Page 48: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/48.jpg)
![Page 49: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/49.jpg)
AMAZON
REDSHIFT FULLY MANAGED, PETA-BYTE SCALE
DATAWAREHOUSE ON AWS
![Page 50: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/50.jpg)
![Page 51: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/51.jpg)
DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was…
AMAZON REDSHIFT
A Whole Lot Simpler
A Lot Cheaper
A Lot Faster
![Page 52: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/52.jpg)
AMAZON REDSHIFT
RUNS ON OPTIMIZED HARDWARE
HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate
HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage
![Page 53: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/53.jpg)
![Page 54: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/54.jpg)
![Page 55: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/55.jpg)
30 MINUTES
DOWN TO
12 SECONDS
![Page 56: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/56.jpg)
![Page 57: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/57.jpg)
Extra Large Node
(HS1.XL)
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
AMAZON REDSHIFT LETS YOU
START SMALL AND GROW BIG
Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)
![Page 58: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/58.jpg)
JDBC/ODBC
![Page 59: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/59.jpg)
![Page 60: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/60.jpg)
Price Per Hour for
HS1.XL Single
Node
Effective Hourly
Price Per TB
Effective Annual
Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year
Reservation $ 0.500 $ 0.250 $ 2,190
3 Year
Reservation $ 0.228 $ 0.114 $ 999
![Page 61: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/61.jpg)
DATA WAREHOUSING DONE THE AWS WAY
No upfront costs, pay as you go
Really fast performance at a really low price
Open and flexible with support for popular tools
Easy to provision and scale up massively
![Page 62: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/62.jpg)
USAGE SCENARIOS
![Page 63: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/63.jpg)
Reporting Warehouse
Accelerated operational reporting
Support for short-time use cases
Data compression, index redundancy
RDBMS Redshift
OLTP ERP Reporting
and BI
![Page 64: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/64.jpg)
Data Integration Partners*
On-Premises Integration
RDBMS Redshift
OLTP ERP Reporting
and BI
![Page 65: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/65.jpg)
Live Archive for (Structured) Big Data
Direct integration with copy command
High velocity data
Data ages into Redshift
Low cost, high scale option for new apps
DynamoDB Redshift
OLTP Web Apps Reporting
and BI
![Page 66: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/66.jpg)
Cloud ETL for Big Data
Maintain online SQL access to historical logs
Transformation and enrichment with EMR
Longer history ensures better insight
Redshift Reporting and BI Elastic MapReduce
S3
![Page 67: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/67.jpg)
create table cf_logs
( d date,
t char(8),
edge char(4),
bytes int,
cip varchar(15),
verb char(3), distro varchar(MAX), object varchar(MAX), status int,
Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )
COPY into Amazon Redshift
![Page 68: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/68.jpg)
copy cf_logs from 's3://cfri/cflogs-sm/E123ABCDEF/'
credentials
'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'
IGNOREHEADER 2
GZIP
DELIMITER '\t'
DATEFORMAT 'YYYY-MM-DD'
COPY into Amazon Redshift
![Page 69: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/69.jpg)
GENERATE STORE ANALYZE SHARE
Amazon EC2
Amazon Elastic
MapReduce
![Page 70: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/70.jpg)
![Page 71: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/71.jpg)
![Page 72: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/72.jpg)
AMAZON EC2 ELASTIC COMPUTE CLOUD
![Page 73: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/73.jpg)
Virtual core: 1
Memory: 1.7 GiB
I/O performance: Moderate
m1.small
EC2 instance families – General purpose
![Page 74: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/74.jpg)
cc2.8xlarge
Virtual core: 32 - 2 x Intel Xeon
Memory: 60,5 GiB
I/O performance: 10 Gbit
m1.small
EC2 instance families – Compute optimized
![Page 75: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/75.jpg)
cc2.8xlarge m1.small cr1.8xlarge
Virtual core: 32 - 2 x Intel Xeon
Memory: 240 GiB
I/O performance: 10 Gbit
SSD Instance store: 240 GB
EC2 instance families – Memory optimized
![Page 76: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/76.jpg)
cc2.8xlarge m1.small cr1.8xlarge hi.4xlarge
Virtual core: 16
Memory: 60.5 GiB
I/O performance: 10 Gbit
SSD Instance store: 2 x 1TB
hs1.8xlarge
Virtual core: 16
Memory: 117 GiB
I/O performance: 10 Gbit
Instance store: 24 x 2TB
EC2 instance families – Storage optimized
![Page 77: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/77.jpg)
ON A SINGLE INSTANCE
COMPUTE TIME: 4h
COST: 4h x $2.1 = $8.4
![Page 78: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/78.jpg)
ON MULTIPLE INSTANCES
COMPUTE TIME: 1h
COST: 1h x 4 x $2.1 = $8.4
![Page 79: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/79.jpg)
![Page 80: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/80.jpg)
3 HOURS FOR $4828.85/hr
![Page 81: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/81.jpg)
Instead of
$20+ MILLIONS
in infrastructure
![Page 82: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/82.jpg)
![Page 83: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/83.jpg)
• A FRAMEWORK
• SPLITS DATA INTO PIECES
• LETS PROCESSING OCCUR
• GATHERS THE RESULTS
![Page 84: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/84.jpg)
AMAZON ELASTIC
MAPREDUCE HADOOP AS A SERVICE
![Page 85: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/85.jpg)
![Page 86: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/86.jpg)
![Page 87: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/87.jpg)
Corporate Data
Center
Elastic Data
Center
![Page 88: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/88.jpg)
Corporate Data
Center
Elastic Data
Center
Application data
and logs for
analysis pushed
to S3
![Page 89: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/89.jpg)
Corporate Data
Center
Elastic Data
Center
Amazon Elastic
Map Reduce
master node to
control analysis
M
![Page 90: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/90.jpg)
Corporate Data
Center
Elastic Data
Center
Hadoop cluster
started by Elastic
Map Reduce
M
![Page 91: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/91.jpg)
Corporate Data
Center
Elastic Data
Center
M
Adding many
hundreds or
thousands of
nodes
![Page 92: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/92.jpg)
Corporate Data
Center
Elastic Data
Center
M
Disposed of when
job completes
![Page 93: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/93.jpg)
Corporate Data
Center
Elastic Data
Center
Results of
analysis pulled
back into your
systems
![Page 94: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/94.jpg)
![Page 95: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/95.jpg)
Your Spreadsheet does not
scale …
![Page 96: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/96.jpg)
![Page 97: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/97.jpg)
PIG
![Page 98: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/98.jpg)
A real Pig script
(used at Twitter)
![Page 99: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/99.jpg)
Run on
a sample
dataset on
your Laptop
![Page 100: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/100.jpg)
$ pig –f myPigFile.q
![Page 101: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/101.jpg)
Elastic Data
Center
M
Run the same
script on a
50 node
Hadoop cluster
![Page 102: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/102.jpg)
$ ./elastic-mapreduce --create
--name "$USER's Pig JobFlow"
--pig-script
--args s3://myawsbucket/mypigquery.q
--instance-type m1.xlarge --instance-count 50
![Page 103: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/103.jpg)
$ elastic-mapreduce -j j-21IMWIA28LRK1
--add-instance-group task
--instance-count 10
--instance-type m1.xlarge
![Page 104: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/104.jpg)
GENERATE STORE ANALYZE SHARE
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
![Page 105: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/105.jpg)
PUBLIC DATA SETS http://aws.amazon.com/publicdatasets
![Page 106: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/106.jpg)
![Page 107: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/107.jpg)
![Page 108: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/108.jpg)
![Page 109: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/109.jpg)
GENERATE STORE ANALYZE SHARE
AWS Data Pipeline
![Page 110: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/110.jpg)
AWS Data Pipeline
Data-intensive orchestration and automation
Reliable and scheduled
Easy to use, drag and drop
Execution and retry logic
Map data dependencies
Create and manage compute resources
![Page 111: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/111.jpg)
![Page 112: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/112.jpg)
![Page 113: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/113.jpg)
![Page 114: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/114.jpg)
GENERATE STORE ANALYZE SHARE
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
AWS Import / Export
AWS Direct Connect
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
Amazon EC2
Amazon Elastic
MapReduce
AWS Data Pipeline
![Page 115: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/115.jpg)
FROM DATA TO
ACTIONABLE
INFORMATION
![Page 116: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/116.jpg)
Shlomi Vaknin
![Page 117: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/117.jpg)
Amazon AWS generates big data core component for Ginger Software
Shlomi Vaknin
Oct 16, 2013
![Page 118: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/118.jpg)
118
English writing assistant
An open platform for personal assistants
![Page 119: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/119.jpg)
119
![Page 120: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/120.jpg)
• Users talk naturally with any mobile application, Ginger understands and executes their command
• An end-to-end Speech-to-Action solution
• First open platform for creating personal assistants
120
Natural language speech interface for mobile apps
![Page 121: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/121.jpg)
Proofreader
Speech Engine
Rephrase
PA Platform DB
Semantic Model
Writing Assistant Personal Coach
Query Understanding
NLP/NLU Algorithms
Web Corpus Language model
Domain Corpus
User Corpus
![Page 122: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/122.jpg)
122
• A collection of all the language we found on the internet, accessible and pre-processed
• Has to contain lots and lots of sentences
• Needs to represent “common written language”
• Accessible both for offline (research) and online (service) uses
Our platform depends on scanning and indexing all the language we can find on the internet
![Page 123: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/123.jpg)
123
1. Crawling [own cluster, EMR+S3] • Generated about 50 TB of raw data • Reduced to about 5 TB of text data
2. Post processing [EMR+S3]
3. Indexing/Serving [EMR+S3] • Key/Value – has to be super fast • Full-text-search
4. Archiving (Glacier) [S3+Glacier] • Keeping data available for later research while minimizing cost
• Tokenize • Normalize • Split to n-grams
• Generalize • Count • Filter
![Page 124: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/124.jpg)
124
• Mainly an NLP task
• So we picked up • It’s a Lisp! • Integrates very well with EMR, S3, etc..
• n-Gram Counting • How are you, How are, are you, How, are, you • Lots of grams are repeated • Generalize contextually similar tokens
• Fits map-reduce paradigm very well • Most parts can be trivially parallelized • One part is sequential by grams
![Page 125: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/125.jpg)
125
• EMR cluster node types • Master, Task, Core
• Ratio between Core and Task nodes • We expected a very large output (100TB)
• m2.4xlarge core output 1690GB
• core nodes
• Estimate number of total map tasks
• Final specs: Node Type Instance Count
MASTER cc2.8xlarge 1
CORE m2.4xlarge 200
TASK m2.2xlarge 500
![Page 126: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/126.jpg)
126
• Job took about 30 hours to complete
• We generated nearly 100TB of output data
• During map phase, the cluster achieved nearly 100% utilization
• After initial filtration, 20TB remained
![Page 127: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/127.jpg)
127
• Stay up to date with AMI releases • Don't stick to an old AMI just because it previously worked
• Use the Job-Tracker • Use custom progress notification • Increase mapred.task.timeout
• Limit number of concurrent map tasks • Use the minimum number that gets you close to 100% CPU
• Beware of spot nodes • If you ask for too many you might compete against your own price
![Page 128: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/128.jpg)
128
• Stash the data for later use, to reduce cost
• Glacier offers very cheap storage
• Important things to know about Glacier: • Restoring the data could be VERY expensive • The key to reduce restore costs - restore SLOWLY • There is no built-in mechanism to restore slowly
• 3rd party application • do it manually
• Glacier is very useful if your use case matches its design
![Page 129: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/129.jpg)
129
• EMR/S3 provides great power and elasticity, to grow and shrink as required
• Do your homework before running large jobs!
![Page 130: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data](https://reader033.vdocuments.site/reader033/viewer/2022051514/54b75a984a7959bd138b45a6/html5/thumbnails/130.jpg)
130
• Our platforms depends on scanning and indexing all the language we can find on the internet
• To achieve this Ginger Software makes heavy use of Amazon EMR
• With Amazon EMR, Ginger Software can scale up vast amounts of computing power and scale back down when it is not needed
• This gives Ginger Software the ability to create the world’s most accurate language enhancement technology without the need to have expensive hardware lying idle during quiet periods