interactively querying large-scale datasets on amazon s3

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Keith Steward, Ph.D.

Specialist (EMR) Solution Architect, AWS

July 13, 2016

Interactively Querying Large-Scale

Datasets on Amazon S3

(Presto on EMR)

Agenda

• The challenges of using data warehouses, then a data

warehouse approach

• High-level steps (overview) for querying large-scale data on

Amazon S3

• Amazon S3

• Amazon EMR

• Apache Presto: history, goals & benefits, architecture

• Presto on EMR

• Demo – Querying 29 years of U.S. Air Flights data on S3 by

using Presto on EMR

Challenges in using data warehousesS

ch

em

a-o

n-W

rite

Data

Data

Warehouse

schema

Significant “time to answer”

Sch

em

a-o

n-R

ead

Data

Shorter “time to answer”

$$$$

$$

How to query large-scale datasets on S3?

1. Store your large-scale data in S3.

2. Configure & launch an EMR cluster with Presto.

3. Log in to the EMR cluster.

4. Expose S3 data as a Hive table

5. Issue SQL queries against the Hive table using Presto.

6. Get query results.

Store anything (object storage)

Scalable

99.999999999% durability

Effectively infinite inbound bandwidth

Extremely low cost: $0.03/GB-Mo; $30.72/TB-Mo

Data layer for virtually all AWS services

Amazon S3

Aggregate all data in S3 as your data lake

surrounded by a collection of the right tools

EMR Kinesis

Redshift DynamoDB RDS

Data Pipeline

Spark Streaming Storm

Amazon

S3

Import/Export

Snowball

Exposing large-scale datasets in S3 as Hive tables

hive> CREATE EXTERNAL TABLE airdelays (yr INT,quarter INT,month INT,dayofmonth INT,dayofweek INT,flightdate STRING,uniquecarrier STRING,airlineid INT,. . .div5tailnum STRING)PARTITIONED BY (year STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','ESCAPED BY '\\'LINES TERMINATED BY '\n'LOCATION 's3://flightdelays-kls/csv’;

S3 bucket with data:

Ask Hive to

expose S3

data as

table:

hive> describe airdelays;OKyr intquarter int month int dayofmonth int dayofweek int flightdate string . . . div5wheelsoff string div5tailnum string year string

# Partition Information# col_name data_type comment

year string Time taken: 0.169 seconds, Fetched: 115 row(s)

Hive now knows about table:

Scalable Hadoop clusters as a service

Hadoop, Hive, Spark, Presto, Hbase, etc.

Easy to use; fully managed

On demand, reserved, spot pricing

HDFS, S3, and Amazon Elastic Block Store (Amazon EBS) file systems

End-to-end security

Amazon EMR

EMRFS makes it easier to leverage Amazon S3

Better performance and error handling options

Transparent to applications – just read/write to “s3://”

Support for Amazon S3 server-side and client-side encryption

Faster listing using EMRFS metadata

HDFS is still available via local instance storage or Amazon EBS

Amazon S3 as your cluster’s persistent data store

Amazon S3Designed for 99.999999999% durabilitySeparate compute and storage

Resize and shut down Amazon EMR clusters with no data loss

Point multiple Amazon EMR clusters at same data in Amazon S3 using the EMR File System (EMRFS)

Demo: Let’s spin up an EMR cluster (with Presto) …

(History)

PB scale interactive query engine designed by

Facebook in 2012

Originally designed for exploring existing Hive

tables without triggering slow MapReduce jobs

Open Source in late 2013

(Benefits)

In-memory distributed query engine

Support standard ANSI-SQL

Support rich analytical functions

Support wide range of data sources

Combine data from multiple sources in single

query

Response time ranges from seconds to

minutes

(Features)

High Performance: 10x faster than Hive

• E.g., Netflix: runs 3500+ Presto queries/day on 25+ PB

dataset in S3 with 350 active platform users

Extensibility

• Pluggable back ends: Hive, Cassandra, JMX, Kafka,

MySQL, PostgreSQL, MySQL, SystemSchema, TPCH

• JDBC, ODBC for commercial BI tools or dashboards,

like data visualization

• Client protocol: HTTP+JSON, support various

languages (Python, Ruby, PHP, Node.js, Java (JDBC),

C#, etc.)

ANSI SQL• Complex queries, joins, aggregations, various functions

(Window functions)

High-level architecture

A distributed system that runs on a cluster of machines.

Components: a coordinator and multiple workers.

Queries are submitted from a client, such as the Presto CLI, to the coordinator.

The coordinator parses, analyzes and plans the query execution, then distributesthe processing to the workers.

https://prestodb.io/overview.html

https://prestodb.io/overview.html

Presto architecture

(Why is it so

fast?)

In-memory parallel queries

Pipeline task execution

Data local computation with multi-threading

Cache hot queries and data

Just-in-time compile-to-bye-code operator

SQL optimization

Other optimizations (e.g., Predicate Pushdown)

Presto: in-memory processing and pipelining

Presto: accessing large-scale datasets in S3

Any table known to the Hive Metastore can be accessed/queried by

Presto.

Including data in S3 exposed via CREATE EXTERNAL TABLE statements in Hive.

Embedded

Mode

• Uses Derby

• Not recommended for

production

Hive Metastore deployment modes

http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html


Embedded

Mode

• Uses Derby


production

Local

Mode

• Metastore service in

same process as main

HiveServer process

• Metastore DB runs in

separate process






Embedded

Mode

• Uses Derby


production

Local

Mode

• Metastore service in

same process as main

HiveServer process

• Metastore DB runs in

separate process

Remote

Mode

• Metastore service runs in

own JVM process

• Processes communicate

with it via Thrift network

API

• Metastore service

communicates with

Metastore DB over

JDBC


Supported data sources

Currently Presto provides connectors for the following:

Hive

Cassandra

MySQL

PostgreSQL

Kafka

Redis

Common use cases

When to use Presto?

• Need fast interactive query ability with high concurrency

• Need ANSI SQL

When might you not want to use Presto?

• You focus on batch processing (ETL, enriching, aggregation, etc.) for large data sets. Hive or Spark recommended.

• Need to compute (e.g., machine learning, graph algorithms) over the retrieved data. Spark recommended.

• Star-schema organization of data. Amazon Redshift data warehouse recommended.

Airpal – a Presto GUI designed & open-sourced by

Airbnb Optional access controls for users

Search and find tables

See metadata, partitions, schemas & sample rows

Write queries in an easy-to-read editor

Submit queries through a web interface

Track query progress

Get the results back through the browser as a CSV

Create new Hive table based on the results of a query

Save queries once written

Searchable history of all queries run within the tool

Demo: Let’s now query 29 years worth of Air-traffic data in

S3 using Presto on our EMR cluster…

Summary for Presto on EMR with data in S3

Data in S3 is queryable using Presto on

Amazon EMR

Presto is easy to deploy on Amazon EMR

Presto provides fast ad-hoc queries

Supports wide range of data sources

In-memory data processing with pipelining

Feature-rich

Increasing adoption & active community

Amazon S3

Amazon

EMR

Remember to complete your

evaluations!

Reference

http://www.slideshare.net/GuorongLIANG/facebook-

presto-presentation

https://prestodb.io

https://github.com/airbnb/airpal#airpal

https://github.com/treasure-data/prestogres

If you want to run this demo later in your own

AWS account,

go to:http://bit.ly/1Xg0111




https://github.com/treasure-data/prestogres

Thank you!

interactively querying large-scale datasets on amazon s3

Technology