ingeniería de sistemas data analytics prof: hugo franco
TRANSCRIPT
Ingeniería de SistemasData Analytics
Prof: Hugo Franco
Session N° 06 | ETL ProcessETL Tools available in the market
Technical issues - Features – Use cases
Bogotá D.C., August 30, 2021
ETL tools within Data AnalyticsframeworksReview:
ETL Tools: selection criteria• Functional criteria:
• Variety of source and target data formats and bindings
• Perform data transformation processes (sort, filter and aggregate).
• Data cleansing, data quality and data governance functionality
• Interoperability• Continuous Integration and Continuous Delivery functionality
• Transparency between on-premise and cloud-based scenarios
• Modularity (changes between providers for each part/step/component of the ETL process)
• Multicloud features
• Integration and Scalability• Adaptation to new technologies and/or implementations
• Horizontal/Vertical scaling features
• Portability between technologies, platforms and infraestructure
* https://www.talend.com/es/resources/etl-tools/
Commercial ETL software at different layers
https://www.talend.com/es/resources/etl-tools/https://streamsets.com/
* Hortonworks is now fused with Cloudera
ETL software: GUI examples
https://www.talend.com/blog/2018/06/25/why-data-scientists-love-python-and-how-to-use-it-with-talend/
https://streamsets.com/blog/automating-pipeline-development-with-the-streamsets-sdk-for-python/
* Several frameworks provide Python bindings (APIs) and/or implement their functionality using Python
ETL Tools market overview
Popular Sources and Targets of ETL processesin real use-cases *
* Retrieved from https://striim.com
Some relevant ETL tools and frameworks
• Talend Data Integration
• Oracle Data Integrator
• Xplenty
• Informatica Power Center
• Stitch
• FlyData
• Fivetran
• AWS Glue
• Pentaho
• Striim
• Panoply
• Hevo Data
• Matillion
Talend Data Integration
• Open-source ETL data integration solution. • Proprietary Talend’s paid Data
Management Platform with additional tools and features • design, productivity, management,
monitoring, and data governance.
Pros:• Compatible with data sources both on-
premises and in the cloud• Hundreds of pre-built integrations.
Cons:• Several interesting features are restricted
to the paid version.
Use cases: • Companies preferring open-source solutions• Companies requiring multiple (or complex)
pre-built data integrations.
* In Big Data frameworks, multiple operations are performed using “lazy” strategies, i.e.:operations are defined (planned) first on metadata, and, then, executed on the actual data
Oracle Data Integrator
• Part of Oracle’s data management ecosystem. • Both on-premises and cloud versions
• Oracle Data Integration Platform Cloud.
Pros:• Simple articulation process with another
Oracle applications (s.a. Hyperion Financial Management or Oracle E-Business Suite, EBS).
Cons:• Supports ELT workloads (not ETL), • Certain tools require different Oracle
software and suites (could be expensive). • The learning curve is steep.
Use cases:• Companies who already acquired another
components of the Oracle ecosystem• Companies whose data integration processes
are oriented to ELT pipelines.
* Tez and Map-Reduce are approaches to data-oriented distributed computing
Xplenty
• Cloud-based ETL and ELT (extract, load, transform) paid data integration platform• Oriented to visual interfaces building data
pipelines using multiple sources and destinations.• Includes connectivity to MongoDB, MySQL,
PostgreSQL, Amazon Redshift, Google Cloud Platform, Facebook, Salesforce, etc.
Pros:
• Supports scalability and securityconfiguration for several scenarios• Field Level Encryption with per-user encryption
key.
• Regulatory compliance to laws like HIPPA, GDPR, and CCPA.
Cons:
• Per year (rigid) billing. Could be expensive forsmall organizations.
Use cases: • Companies using both ETL and ELT workloads• Companies who prefer visual interfaces• Companies requiring multiple pre-built
integrations• Companies requiring strong data security
features.
Informatica Powercenter
• Enterprise data integration platform for ETL workloads. • PowerCenter is part of the Informatica cloud
data management tools suite.
• Enterprise-class, database-neutral solution including cloud data management tools
• Pros:• High performance and high compatibility
• Cons:• Could be expensive, the payment plans are
complex • Steep learning curve
Use cases:• Large enterprises with robust budgets invested
to solve big data problems and/or heavy analytic workloads, or demanding high performance features.
AWS Glue
• Fully managed ETL service from Amazon Web Services • Designed for big data analytics. • Job scheduling • “Developer endpoints” for testing scripts.
Pros:
• Direct integration with other services and tools in the AWS ecosystem.
• Serverless: Amazon automatically provisions a server for users and shuts it down when the workload is complete.
Cons:
• Could be less flexible than other tools, and typically best suited to users who are already within the AWS ecosystem.
Use cases: • Companies already using AWS for their data
analytics needs.• Companies requiring fully managed ETL
approaches.
Pentaho (Kettle)
• Open-source platform offered by used for both data integration and analytics. • There is a free community edition along
with a commercial license for the software’s enterprise edition.
Pros:
• User-friendly interface that lets to build “robust” ETL pipelines.
Cons:
• Some users report poor documentation, especially for error detection and management.
Hitachi - Vantara
Use cases:• Companies oriented to open-source based
tools and frameworks, including those oriented to ETL processes.
Striim• Real-time data integration for big data.
• Multiple sources and targets of different types.
• About 20 different file formats (Oracle, SQL Server, MySQL, PostgreSQL, MongoDB, Hadoop, etc.).
Pros
• Compliant with data privacy regulations such as GDPR and HIPAA.
• Pre-load transformations using SQL or Java.
Cons
• Does not include SaaS sources or targets.
• Does not allow add new data sources.
• Small community
Use cases:• Companies requiring GDPR or HIPAA compliance.• Companies using a fixed set of data sources and do
not require SaaS interoperability.
FiveTran
• Cloud-based, data warehouse-oriented ETL platform• Data integration with Redshift, BigQuery,
Azure, and Snowflake• About 90 possible SaaS sources • Custom integrations are supported
Pros:
• Easy to use (management and user interfaces)
• New connectors are configured a fast and straightforward manner (in most cases)
Cons:
• The pricing model is complex and depends on the several factors.
• Some uses report poor support for complex technical
Use cases:• Companies requiring several pre-built integrations• Companies using multiple data warehouses of
different types
Stitch• Open-source ELT data integration platform
with paid service extensions for advanced use cases and larger numbers of data sources.
Pros
• Self-service ELT and automated data pipelines..
• Simple pricing
• High performance according to several tests
Cons:
• Does not perform arbitrary transformations, since they are implemented on top of raw data after the Extraction process
• Stitch was acquired by Talend in fall 2018.
• Some technical issues are reported
• Some data sources are poorly supported
Use cases:• Companies oriented to open-source solutions• Companies using simple ELT process• Companies not requiring complex transformations
* Most ELT tools do not implement explicit transformations on raw data, sincesome Data Analytic processes perform them as the first step of the analysis