interactively querying large-scale datasets on amazon s3
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Keith Steward, Ph.D.
Specialist (EMR) Solution Architect, AWS
July 13, 2016
Interactively Querying Large-Scale
Datasets on Amazon S3
(Presto on EMR)
Agenda
• The challenges of using data warehouses, then a data
warehouse approach
• High-level steps (overview) for querying large-scale data on
Amazon S3
• Amazon S3
• Amazon EMR
• Apache Presto: history, goals & benefits, architecture
• Presto on EMR
• Demo – Querying 29 years of U.S. Air Flights data on S3 by
using Presto on EMR
Challenges in using data warehousesS
ch
em
a-o
n-W
rite
Data
Data
Warehouse
schema
Significant “time to answer”
Sch
em
a-o
n-R
ead
Data
Shorter “time to answer”
$$$$
$$
How to query large-scale datasets on S3?
1. Store your large-scale data in S3.
2. Configure & launch an EMR cluster with Presto.
3. Log in to the EMR cluster.
4. Expose S3 data as a Hive table
5. Issue SQL queries against the Hive table using Presto.
6. Get query results.
Store anything (object storage)
Scalable
99.999999999% durability
Effectively infinite inbound bandwidth
Extremely low cost: $0.03/GB-Mo; $30.72/TB-Mo
Data layer for virtually all AWS services
Amazon S3
Aggregate all data in S3 as your data lake
surrounded by a collection of the right tools
EMR Kinesis
Redshift DynamoDB RDS
Data Pipeline
Spark Streaming Storm
Amazon
S3
Import/Export
Snowball
Exposing large-scale datasets in S3 as Hive tables
hive> CREATE EXTERNAL TABLE airdelays (yr INT,quarter INT,month INT,dayofmonth INT,dayofweek INT,flightdate STRING,uniquecarrier STRING,airlineid INT,. . .div5tailnum STRING)PARTITIONED BY (year STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','ESCAPED BY '\\'LINES TERMINATED BY '\n'LOCATION 's3://flightdelays-kls/csv’;
S3 bucket with data:
Ask Hive to
expose S3
data as
table:
hive> describe airdelays;OKyr intquarter int month int dayofmonth int dayofweek int flightdate string . . . div5wheelsoff string div5tailnum string year string
# Partition Information# col_name data_type comment
year string Time taken: 0.169 seconds, Fetched: 115 row(s)
Hive now knows about table:
Scalable Hadoop clusters as a service
Hadoop, Hive, Spark, Presto, Hbase, etc.
Easy to use; fully managed
On demand, reserved, spot pricing
HDFS, S3, and Amazon Elastic Block Store (Amazon EBS) file systems
End-to-end security
Amazon EMR
EMRFS makes it easier to leverage Amazon S3
Better performance and error handling options
Transparent to applications – just read/write to “s3://”
Support for Amazon S3 server-side and client-side encryption
Faster listing using EMRFS metadata
HDFS is still available via local instance storage or Amazon EBS
Amazon S3 as your cluster’s persistent data store
Amazon S3Designed for 99.999999999% durabilitySeparate compute and storage
Resize and shut down Amazon EMR clusters with no data loss
Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)
Demo: Let’s spin up an EMR cluster (with Presto) …
(History)
PB scale interactive query engine designed by
Facebook in 2012
Originally designed for exploring existing Hive
tables without triggering slow MapReduce jobs
Open Source in late 2013
(Benefits)
In-memory distributed query engine
Support standard ANSI-SQL
Support rich analytical functions
Support wide range of data sources
Combine data from multiple sources in single
query
Response time ranges from seconds to
minutes
(Features)
High Performance: 10x faster than Hive
• E.g., Netflix: runs 3500+ Presto queries/day on 25+ PB
dataset in S3 with 350 active platform users
Extensibility
• Pluggable back ends: Hive, Cassandra, JMX, Kafka,
MySQL, PostgreSQL, MySQL, SystemSchema, TPCH
• JDBC, ODBC for commercial BI tools or dashboards,
like data visualization
• Client protocol: HTTP+JSON, support various
languages (Python, Ruby, PHP, Node.js, Java (JDBC),
C#, etc.)
ANSI SQL• Complex queries, joins, aggregations, various functions
(Window functions)
High-level architecture
A distributed system that runs on a cluster of machines.
Components: a coordinator and multiple workers.
Queries are submitted from a client, such as the Presto CLI, to the coordinator.
The coordinator parses, analyzes and plans the query execution, then distributesthe processing to the workers.
https://prestodb.io/overview.html
Presto architecture
(Why is it so
fast?)
In-memory parallel queries
Pipeline task execution
Data local computation with multi-threading
Cache hot queries and data
Just-in-time compile-to-bye-code operator
SQL optimization
Other optimizations (e.g., Predicate Pushdown)
Presto: in-memory processing and pipelining
Presto: accessing large-scale datasets in S3
Any table known to the Hive Metastore can be accessed/queried by
Presto.
Including data in S3 exposed via CREATE EXTERNAL TABLE statements in Hive.
Embedded
Mode
• Uses Derby
• Not recommended for
production
Hive Metastore deployment modes
http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html
Embedded
Mode
• Uses Derby
• Not recommended for
production
Local
Mode
• Metastore service in
same process as main
HiveServer process
• Metastore DB runs in
separate process
http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html
Hive Metastore deployment modes
Hive Metastore deployment modes
http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html
Embedded
Mode
• Uses Derby
• Not recommended for
production
Local
Mode
• Metastore service in
same process as main
HiveServer process
• Metastore DB runs in
separate process
Remote
Mode
• Metastore service runs in
own JVM process
• Processes communicate
with it via Thrift network
API
• Metastore service
communicates with
Metastore DB over
JDBC
Supported data sources
Currently Presto provides connectors for the following:
Hive
Cassandra
MySQL
PostgreSQL
Kafka
Redis
Common use cases
When to use Presto?
• Need fast interactive query ability with high concurrency
• Need ANSI SQL
When might you not want to use Presto?
• You focus on batch processing (ETL, enriching, aggregation, etc.) for large data sets. Hive or Spark recommended.
• Need to compute (e.g., machine learning, graph algorithms) over the retrieved data. Spark recommended.
• Star-schema organization of data. Amazon Redshift data warehouse recommended.
Airpal – a Presto GUI designed & open-sourced by
Airbnb Optional access controls for users
Search and find tables
See metadata, partitions, schemas & sample rows
Write queries in an easy-to-read editor
Submit queries through a web interface
Track query progress
Get the results back through the browser as a CSV
Create new Hive table based on the results of a query
Save queries once written
Searchable history of all queries run within the tool
Demo: Let’s now query 29 years worth of Air-traffic data in
S3 using Presto on our EMR cluster…
Summary for Presto on EMR with data in S3
Data in S3 is queryable using Presto on
Amazon EMR
Presto is easy to deploy on Amazon EMR
Presto provides fast ad-hoc queries
Supports wide range of data sources
In-memory data processing with pipelining
Feature-rich
Increasing adoption & active community
Amazon S3
Amazon
EMR
Remember to complete your
evaluations!
Reference
http://www.slideshare.net/GuorongLIANG/facebook-
presto-presentation
https://prestodb.io
https://github.com/airbnb/airpal#airpal
https://github.com/treasure-data/prestogres
If you want to run this demo later in your own
AWS account,
go to:http://bit.ly/1Xg0111
Thank you!