serverless analytics and etl on aws presentation- aws … · serverless analytics and etl on aws...
TRANSCRIPT
![Page 1: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/1.jpg)
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Serverless Analytics and ETL on
AWSDaniel Haviv
Analytics Specialist Solutions Architect
AWS
![Page 2: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/2.jpg)
Data Lake
• Central repository both structured and
unstructured data
• High in capacity
• Cheap
• Accessible (API, CLI)
• Wide range of integrated tools
![Page 3: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/3.jpg)
Data Lake - HDFS
• HDFS is a good candidate but it has it’s
limitations:• High maintenance overhead (1000s of servers, 10ks of disks)
• Not cheap (3 copies per file)
• Usually serves one Hadoop cluster
![Page 4: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/4.jpg)
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
Multiple upload
Range GET
Store as much as you need
Scale storage and compute
independently
No minimum usage
commitments
Scalable
Amazon Redshift / Spectrum
Amazon EMR
Amazon Athena
AWS Lambda
Integrated
Simple REST API
AWS SDKs
Read-after-create consistency
Event notification
Lifecycle policies
Easy to use
Why Amazon S3 for the Data Lake?
![Page 5: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/5.jpg)
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Athena
![Page 6: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/6.jpg)
Challenges Customers Faced
• Significant amount of work required to analyze data in Amazon S3
• Users often only have access to aggregated data sets
• Managing a Hadoop cluster or data warehouse requires expertise
![Page 7: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/7.jpg)
Introducing Amazon Athena
• Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
![Page 8: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/8.jpg)
A Sample Pipeline
![Page 9: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/9.jpg)
A Sample Pipeline
Ad-hoc access to raw data using SQL
![Page 10: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/10.jpg)
A Sample Pipeline
Ad-hoc access to data using AthenaAthena can query
aggregated datasets as well
![Page 11: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/11.jpg)
Amazon Confidential
Athena is Serverless
• No Infrastructure or administration
• Zero Spin up time
• Transparent upgrades
• Highly Available
• You connect to a service endpoint or log into the console
![Page 12: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/12.jpg)
Amazon Athena is Easy To Use
• Log into the Console
• Create a table• Type in a Hive DDL Statement
• Use the console Add Table wizard
• Start querying
![Page 13: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/13.jpg)
Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format• Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the best performance and lowest cost
• No ETL required
• Stream data from directly from Amazon S3
• Take advantage of Amazon S3 durability and availability
![Page 14: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/14.jpg)
Use ANSI SQL
• Start writing ANSI SQL
• Support for complex joins, nested queries & window functions
• Support for complex data types (arrays, structs)
• Support for partitioning of data by any key
• (date, time, custom keys)
• e.g., Year, Month, Day, Hour or Customer Key, Date
![Page 15: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/15.jpg)
Amazon Confidential
Amazon Athena is Cost Effective
• Pay per query
• $5 per TB scanned from S3
• DDL Queries and failed queries are free
• Save by using compression, columnar formats, partitions
![Page 16: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/16.jpg)
• Anyone looking to process data stored in Amazon S3
• Data coming IOT Devices, Apache Logs, Omniture logs, CF logs,
Application Logs
• Anyone who knows SQL
• Both developers or Analysts
• Ad-hoc exploration of data and data discovery
• Customers looking to build a data lake on Amazon S3
Who is Athena for?
![Page 17: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/17.jpg)
Accessing Amazon Athena
• Through the console
• Via the AWS API
• Using a JDBC/ODBC clients (either plain SQL
client or BI tools)
![Page 18: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/18.jpg)
Creating Tables and Querying Data
![Page 19: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/19.jpg)
Example
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING,month STRING, day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/'
External = creates a view of this data.
When you delete the table, the data is not
deleted
![Page 20: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/20.jpg)
Example
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING,month STRING, day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/'
Location = where data is stored.
In Athena this is mandated to be
in Amazon S3
![Page 21: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/21.jpg)
Example
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING,month STRING, day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/'
Partitioning allows you to limit what your
query runs on
![Page 22: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/22.jpg)
Pay By the Query - $5/TB Scanned
• Pay by the amount of data scanned per query
• Ways to save costs
• Compress
• Convert to Columnar format
• Use partitioning
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data
scanned
99.7% cheaper
![Page 23: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/23.jpg)
Demo
![Page 24: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/24.jpg)
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Glue
![Page 25: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/25.jpg)
Why would AWS get into the ETL space?
![Page 26: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/26.jpg)
We have lots of ETL partners
Amazon Redshift Partner Page for Data Integration
Fivetran
![Page 27: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/27.jpg)
The problem is
70% of ETL jobs are hand-coded
With no use of ETL tools.
![Page 28: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/28.jpg)
Actually…
It’s over 90% in the cloud
![Page 29: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/29.jpg)
Code is flexible Code is powerful
You can unit test You can deploy with other code You know your dev tools
Why do we see so much hand-coding?
![Page 30: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/30.jpg)
AWS Glue automates
the undifferentiated heavy lifting of ETL
Automatically discover and categorize your data making it immediately searchable
and queryable across data sources
Generate code to clean, enrich, and reliably move data between various data
sources; you can also use their favorite tools to build ETL jobs
Run your jobs on a serverless, fully managed, scale-out environment. No compute
resources to provision or manage.
Discover
Develop
Deploy
![Page 31: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/31.jpg)
AWS Glue: Components
Data Catalog
Hive Metastore compatible with enhanced functionality
Crawlers automatically extracts metadata and creates tables
Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
Run jobs on a serverless Spark platform
Provides flexible scheduling
Handles dependency resolution, monitoring and alerting
Job Authoring
Auto-generates ETL code
Build on open frameworks – Python and Spark
Developer-centric – editing, debugging, sharing
![Page 32: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/32.jpg)
Main components of AWS Glue
![Page 33: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/33.jpg)
AWS Glue Data Catalog
Discover and organize your data
![Page 34: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/34.jpg)
Glue data catalog
Manage table metadata through a Hive metastore API or Hive SQL.
Supported by tools like Hive, Presto, Spark etc.
We added a few extensions:
Search over metadata for data discovery
Connection info – JDBC URLs, credentials
Classification for identifying and parsing files
Versioning of table metadata as schemas evolve and other metadata are updated
Populate using Hive DDL, bulk import, or automatically through Crawlers.
![Page 35: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/35.jpg)
Data Catalog: Crawlers
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok expressions
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Crawlers automatically build your Data Catalog and keep it in sync
![Page 36: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/36.jpg)
AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable
![Page 37: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/37.jpg)
Data Catalog: Table details
Table schema
Table properties
Data statistics
Nested fields
![Page 38: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/38.jpg)
Data Catalog: Version control
List of table versionsCompare schema versions
![Page 39: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/39.jpg)
Data Catalog: Detecting partitions
file 1 file N… file 1 file N…
date=10 date=15…
month=Nov
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93month
date
col 1
col 2
str
str
int
float
Column Type
![Page 40: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/40.jpg)
Data Catalog: Automatic partition detection
Automatically register available partitions
Table
partitions
![Page 41: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/41.jpg)
Job authoring in AWS Glue
Python code generated by AWS Glue
Connect a notebook or IDE to AWS Glue
Existing code brought into AWS Glue
You have choices on
how to get started
![Page 42: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/42.jpg)
1. Customize the mappings
2. Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: Automatic code generation
![Page 43: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/43.jpg)
Human-readable, editable, and portable PySpark code
Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data
Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries
Collaborative: share code snippets via GitHub, reuse code across jobs
Job authoring: ETL code
![Page 44: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/44.jpg)
Job Authoring: Glue Dynamic Frames
Dynamic frame schema
A C D [ ]
X Y
B1 B2
Like Spark’s Data Frames, but better for:
• Cleaning and (re)-structuring semi-structured
data sets, e.g. JSON, Avro, Apache logs ...
No upfront schema needed:
• Infers schema on-the-fly, enabling transformations
in a single pass
Easy to handle the unexpected:
• Tracks new fields, and inconsistent changing data
types with choices, e.g. integer or string
• Automatically mark and separate error records
![Page 45: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/45.jpg)
Job Authoring: Glue transforms
ResolveChoice() B B B
project
B
cast
B
separate into cols
B B
Apply Mapping() A
X Y
A X Y
Adaptive and flexible
C
![Page 46: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/46.jpg)
Job authoring: Relationalize() transform
Semi-structured schema Relational schema
FKA B B C.X C.Y
PK ValueOffset
A C D [ ]
X Y
B B
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing
![Page 47: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/47.jpg)
Job authoring: Glue transformations
Prebuilt transformation: Click and
add to your job with simple
configuration
Spigot writes sample data from
DynamicFrame to S3 in JSON format
Expanding… more transformations
to come
![Page 48: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/48.jpg)
Job authoring: Write your own scripts
Import custom libraries required by your code
Convert to a Spark Data Frame
for complex SQL-based ETL
Convert back to Glue Dynamic Frame
for semi-structured processing and
AWS Glue connectors
![Page 49: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/49.jpg)
Job authoring: Developer endpoints
Environment to iteratively develop and test ETL code.
Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.
When you are satisfied with the results you can create an ETL job that runs your code.
Glue Spark environment
Remote
interpreter
Interpreter
server
![Page 50: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/50.jpg)
Job Authoring: Leveraging the community
No need to start from scratch.
Use Glue samples stored in Github to share, reuse,
contribute: https://github.com/awslabs/aws-glue-samples
• Migration scripts to import existing Hive Metastore data
into AWS Glue Data Catalog
• Examples of how to use Dynamic Frames and
Relationalize() transform
• Examples of how to use arbitrary PySpark code with
Glue’s Python ETL library
Download Glue’s Python ETL library to start developing code
in your IDE: https://github.com/awslabs/aws-glue-libs
![Page 51: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/51.jpg)
Orchestration and resource management
Fully managed, serverless job execution
![Page 52: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/52.jpg)
Job execution: Scheduling and monitoring
Compose jobs globally with event-
based dependencies
Easy to reuse and leverage work across
organization boundaries
Multiple triggering mechanisms
Schedule-based: e.g., time of day
Event-based: e.g., job completion
On-demand: e.g., AWS Lambda
More coming soon: Data Catalog based
events, S3 notifications and Amazon
CloudWatch events
Logs and alerts are available in
Amazon CloudWatch
Marketing: Ad-spend by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
Weekly
sales
Data
based
![Page 53: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/53.jpg)
Job execution: Job bookmarks
For example, you get new files everyday
in your S3 bucket. By default, AWS Glue
keeps track of which files have been
successfully processed by the job to
prevent data duplication.
Option Behavior
Enable Pick up from where you left off
DisableIgnore and process the entire dataset
every time
PauseTemporarily disable advancing the
bookmark
Marketing: Ad-spend by customer segment
Data objects
Glue keeps track of data that has already
been processed by a previous run of an
ETL job. This persisted state information is
called a bookmark.
![Page 54: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/54.jpg)
Job execution: Serverless
Auto-configure VPC and role-based access
Customers can specify the capacity that
gets allocated to each job
Automatically scale resources (on post-GA
roadmap)
You pay only for the resources you
consume while consuming them
There is no need to provision, configure, or
manage servers
Customer VPC Customer VPC
Compute instances
![Page 55: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/55.jpg)
Common use cases for AWS Glue
![Page 56: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/56.jpg)
Understand your data assets
![Page 57: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/57.jpg)
Instantly query your data lake on Amazon S3
![Page 58: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/58.jpg)
ETL data into your data warehouse
![Page 59: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/59.jpg)
Build event-driven ETL pipelines
![Page 60: Serverless Analytics and ETL on AWS presentation- AWS … · Serverless Analytics and ETL on AWS Daniel Haviv Analytics Specialist Solutions Architect AWS dhaviv@amazon.com. Data](https://reader031.vdocuments.site/reader031/viewer/2022041013/5ec462b2754feb7aad16f82e/html5/thumbnails/60.jpg)
Thank you.