ingeniería de sistemas data analytics prof: hugo franco

17
Ingeniería de Sistemas Data Analytics Prof: Hugo Franco Session N° 06 | ETL Process ETL Tools available in the market Technical issues - Features – Use cases Bogotá D.C., August 30, 2021

Upload: others

Post on 20-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Ingeniería de SistemasData Analytics

Prof: Hugo Franco

Session N° 06 | ETL ProcessETL Tools available in the market

Technical issues - Features – Use cases

Bogotá D.C., August 30, 2021

Page 2: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ETL tools within Data AnalyticsframeworksReview:

Page 3: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ETL Tools: selection criteria• Functional criteria:

• Variety of source and target data formats and bindings

• Perform data transformation processes (sort, filter and aggregate).

• Data cleansing, data quality and data governance functionality

• Interoperability• Continuous Integration and Continuous Delivery functionality

• Transparency between on-premise and cloud-based scenarios

• Modularity (changes between providers for each part/step/component of the ETL process)

• Multicloud features

• Integration and Scalability• Adaptation to new technologies and/or implementations

• Horizontal/Vertical scaling features

• Portability between technologies, platforms and infraestructure

* https://www.talend.com/es/resources/etl-tools/

Page 4: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Commercial ETL software at different layers

https://www.talend.com/es/resources/etl-tools/https://streamsets.com/

* Hortonworks is now fused with Cloudera

Page 5: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ETL software: GUI examples

https://www.talend.com/blog/2018/06/25/why-data-scientists-love-python-and-how-to-use-it-with-talend/

https://streamsets.com/blog/automating-pipeline-development-with-the-streamsets-sdk-for-python/

* Several frameworks provide Python bindings (APIs) and/or implement their functionality using Python

Page 6: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ETL Tools market overview

Page 7: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Popular Sources and Targets of ETL processesin real use-cases *

* Retrieved from https://striim.com

Page 8: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Some relevant ETL tools and frameworks

• Talend Data Integration

• Oracle Data Integrator

• Xplenty

• Informatica Power Center

• Stitch

• FlyData

• Fivetran

• AWS Glue

• Pentaho

• Striim

• Panoply

• Hevo Data

• Matillion

Page 9: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Talend Data Integration

• Open-source ETL data integration solution. • Proprietary Talend’s paid Data

Management Platform with additional tools and features • design, productivity, management,

monitoring, and data governance.

Pros:• Compatible with data sources both on-

premises and in the cloud• Hundreds of pre-built integrations.

Cons:• Several interesting features are restricted

to the paid version.

Use cases: • Companies preferring open-source solutions• Companies requiring multiple (or complex)

pre-built data integrations.

* In Big Data frameworks, multiple operations are performed using “lazy” strategies, i.e.:operations are defined (planned) first on metadata, and, then, executed on the actual data

Page 10: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Oracle Data Integrator

• Part of Oracle’s data management ecosystem. • Both on-premises and cloud versions

• Oracle Data Integration Platform Cloud.

Pros:• Simple articulation process with another

Oracle applications (s.a. Hyperion Financial Management or Oracle E-Business Suite, EBS).

Cons:• Supports ELT workloads (not ETL), • Certain tools require different Oracle

software and suites (could be expensive). • The learning curve is steep.

Use cases:• Companies who already acquired another

components of the Oracle ecosystem• Companies whose data integration processes

are oriented to ELT pipelines.

* Tez and Map-Reduce are approaches to data-oriented distributed computing

Page 11: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Xplenty

• Cloud-based ETL and ELT (extract, load, transform) paid data integration platform• Oriented to visual interfaces building data

pipelines using multiple sources and destinations.• Includes connectivity to MongoDB, MySQL,

PostgreSQL, Amazon Redshift, Google Cloud Platform, Facebook, Salesforce, etc.

Pros:

• Supports scalability and securityconfiguration for several scenarios• Field Level Encryption with per-user encryption

key.

• Regulatory compliance to laws like HIPPA, GDPR, and CCPA.

Cons:

• Per year (rigid) billing. Could be expensive forsmall organizations.

Use cases: • Companies using both ETL and ELT workloads• Companies who prefer visual interfaces• Companies requiring multiple pre-built

integrations• Companies requiring strong data security

features.

Page 12: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Informatica Powercenter

• Enterprise data integration platform for ETL workloads. • PowerCenter is part of the Informatica cloud

data management tools suite.

• Enterprise-class, database-neutral solution including cloud data management tools

• Pros:• High performance and high compatibility

• Cons:• Could be expensive, the payment plans are

complex • Steep learning curve

Use cases:• Large enterprises with robust budgets invested

to solve big data problems and/or heavy analytic workloads, or demanding high performance features.

Page 13: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

AWS Glue

• Fully managed ETL service from Amazon Web Services • Designed for big data analytics. • Job scheduling • “Developer endpoints” for testing scripts.

Pros:

• Direct integration with other services and tools in the AWS ecosystem.

• Serverless: Amazon automatically provisions a server for users and shuts it down when the workload is complete.

Cons:

• Could be less flexible than other tools, and typically best suited to users who are already within the AWS ecosystem.

Use cases: • Companies already using AWS for their data

analytics needs.• Companies requiring fully managed ETL

approaches.

Page 14: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Pentaho (Kettle)

• Open-source platform offered by used for both data integration and analytics. • There is a free community edition along

with a commercial license for the software’s enterprise edition.

Pros:

• User-friendly interface that lets to build “robust” ETL pipelines.

Cons:

• Some users report poor documentation, especially for error detection and management.

Hitachi - Vantara

Use cases:• Companies oriented to open-source based

tools and frameworks, including those oriented to ETL processes.

Page 15: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Striim• Real-time data integration for big data.

• Multiple sources and targets of different types.

• About 20 different file formats (Oracle, SQL Server, MySQL, PostgreSQL, MongoDB, Hadoop, etc.).

Pros

• Compliant with data privacy regulations such as GDPR and HIPAA.

• Pre-load transformations using SQL or Java.

Cons

• Does not include SaaS sources or targets.

• Does not allow add new data sources.

• Small community

Use cases:• Companies requiring GDPR or HIPAA compliance.• Companies using a fixed set of data sources and do

not require SaaS interoperability.

Page 16: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

FiveTran

• Cloud-based, data warehouse-oriented ETL platform• Data integration with Redshift, BigQuery,

Azure, and Snowflake• About 90 possible SaaS sources • Custom integrations are supported

Pros:

• Easy to use (management and user interfaces)

• New connectors are configured a fast and straightforward manner (in most cases)

Cons:

• The pricing model is complex and depends on the several factors.

• Some uses report poor support for complex technical

Use cases:• Companies requiring several pre-built integrations• Companies using multiple data warehouses of

different types

Page 17: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Stitch• Open-source ELT data integration platform

with paid service extensions for advanced use cases and larger numbers of data sources.

Pros

• Self-service ELT and automated data pipelines..

• Simple pricing

• High performance according to several tests

Cons:

• Does not perform arbitrary transformations, since they are implemented on top of raw data after the Extraction process

• Stitch was acquired by Talend in fall 2018.

• Some technical issues are reported

• Some data sources are poorly supported

Use cases:• Companies oriented to open-source solutions• Companies using simple ELT process• Companies not requiring complex transformations

* Most ELT tools do not implement explicit transformations on raw data, sincesome Data Analytic processes perform them as the first step of the analysis