ingeniería de sistemas data analytics prof: hugo franco

24
Ingeniería de Sistemas Data Analytics Prof: Hugo Franco Session N° 04 | ETL Process Extraction and Transformation Details ETL tools (properties-selection criteria) Bogotá D.C., August 23, 2021

Upload: others

Post on 13-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Ingeniería de SistemasData Analytics

Prof: Hugo Franco

Session N° 04 | ETL ProcessExtraction and Transformation DetailsETL tools (properties-selection criteria)

Bogotá D.C., August 23, 2021

Page 2: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Contents

• ETL contd’• Analysis oriented Transformations.

• Extraction details• Transformation details• Tools for ETL

• Functional requirements. Interoperability, scalability, portability

Page 3: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Review

Page 4: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Context: standardized processesCRISP-DM (1999) ASUM (IBM, 2015)

1. Evaluation of the organization's readiness and conditions for the analytical process.

2. Understanding of the organization's information needs (business model; questions and problems that guide the analytics process).

3. Understanding the data (validation of the data quality -format and content-, descriptive statistics)

4. Preparation of complete data (profiling,mapping of source systems to destination systems, ETL processes).

5. Modeling o Exploration, selection and implementation of methods for the identification of patterns, trends, categories, etc.

6. Evaluation of the process.

7. Deployment in the support information system.

8. Feedback (based on the effectiveness of the models in production).

Page 5: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Extraction, Transformation and Loading (ETL)

• Understand the data contents and its quality• Correctness• Completeness• Compliance• Relevance• Opportuneness

• ETL rule validation

• Time estimation

Page 6: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ExtractionETL process

Page 7: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ETL: Extract

Extract

Detect changes in the

original source

Acquire data from

source (access.-

download-filter)

Staging

• Is there any change in the relevant data sincethe last execution?• Decide the strategy to process new or updated

records.

• The data updates are notified by the source ormust be queried to it?• Extract vs. Receive strategies• Publish-suscribe strategies could reduce the process

complexity and enhances its stability

• Storage or not• Plain text files are simple, easier to explore and

process in contrast to binary files

Page 8: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Common data formats

• Generic: plain text (CSV, TSV), XML, JSON

• Adapted to DBMS drivers: ODBC, JDBC

• Distributed file system-oriented : Hadoop Distributed File System (HDFS), Apache Hive/HCatalog

• Proprietary Databases: Oracle, IBM DB2, SQL Server, Sybase

• Framework-oriented: IBM Websphere MQ, Salesforce.com, SAP/R3, Teradata, Vertica, Netezza, Greenplum Mainframe (IBM z/OS, data warehousing)

Page 9: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

TransformETL process

Page 10: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ETL vs. ELT

• In ETL, data is transferred from the data source to staging into the data warehouse.

• ELT leverages the data warehouse to do basic transformations. There is no need for data staging.

• ETL can help with data privacy and compliance by cleaning sensitive and secure data even before loading into the data warehouse.

• ETL can perform sophisticated data transformations and can be more cost-effective than ELT.

* https://www.xplenty.com/blog/etl-vs-elt/

Page 11: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Transform

• How will the records be assemblied?• Data usually come from different sources, in different

formats and at different time intervals (updates)

• What are the de-duplication rules?• Consistent mechanisms to reduce the complexity and

enhance the stability of the system

• How will the defective records be evaluated and processed?• Detection, management and auditing are always required

• Other transformations• Problem specific transformations, key generation,

conventions within the information system, etc.

Transform

Data cleansing

Duplicate elimination

Staging

Problem-specific

transformations

Page 12: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Transform operation types

• Transforms: Aggregation, Copy, Join, Sort, Merge, Partition, Filter, Reformat, Lookup

• Mathematical: +, -, x, /, Abs, IsValidNumber, Mod, Pow, Rand, Round, Sqrt, ToNumber, Truncate, Average, Min, Max

• Logical: And, Or, Not, IfThenElse, RegEx, Variables

• Text: Concatenate, CharacterLengthOf, LengthOf, Pad, Replace, ToLower, ToText, ToUpper, Translate, Trim, Hash

• Date: DateAdd, DateDiff, DateLastDay, DatePart, IsValidDate

• Format: ASCII, EBCDIC, Unicode

Page 13: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Transform

Operation and Instrumentation

Timetable

Notifications

Activity Monitoring

Auditing

Backup

Security

Re-execution

Regulation

• The process finished properly?

• Is it posible to monitor the process

• Are there enough detail in case of runtime errors?

• Is it possible to audit or solve runtime errors?

• In case of failure, is it posible to restart the process without human intervention?

• Is it possible to secure the source code and both transfer and stored data?

• Is it possible to rely in the resulting system?

Page 14: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Transform from the Analyticsperspective

Page 15: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Analysis-oriented transforms

• Data format change (precision, data type, encoding / location)

• Location-based unit conversion (e.g. mph to km / h, currency conversion, unit standardization)

• Selection / omission of columns for the Loading process (e.g. make that columns with null values are not fed to the warehouse or data lake)

• Adding columns (e.g., relate high-tech equipment brands to country of origin

• Division of a column in several (e.g. normalize the name of a subject, separating names from surnames)

• Code translation (e.g. convert the code “H” for men and an “M” for women to “1” for men and a “2” for women.

• Get new calculated values (mathematical functions applied to data series in one or more columns of the source)

Page 16: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Compound transforms

In ETL for Data Analytics (Data Integration) it is usual that the Analysisrequire data from multiple sources.

• Lookups: comparing a piece of data with data from another source, crossing information (e.g., capturing a client code from a database and crossing it with another base of loans granted to find out whether or not said client enjoys that loan).

• Pivoting. Convert a normalized data set to a less normalized but more compact version by using the values of multiple rows as column labels, i.e. pivot multple files into columns (e.g. Sales per month in a store) in columns of sales per month and store, with a single sales value in the rows).

Page 17: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ETL toolsSelection criteria and examples

Relationship with the goal of the Analytics process

Page 18: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ETL tools: functional requirements

• Read and write from the full range of required data sources, whether they arelocated locally or in the cloud.

• Perform data transformation processes (sort, filter and aggregate).

• Provide built-in data governance and quality capabilities such as datadeduplication, matching, and profiling.

• Include collaboration tools.• Reusing previous development elements will be easier, and the resulting data integration

flows could be more efficient.

• A single task could support multiple destinations instead of using a series of data integrationflows doing similar things all along.

https://www.talend.com/es/resources/etl-tools/

Page 19: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ETL tools: interoperability

• As cloud systems became a standard, the ability to adapt to CI / CD (ContinuousIntegration / Delivery) processes is now a must.

• The ETL tool should be able to operate on any environment: local, cloud orhybrid infrastructures.

• An ETL tool should be able to adapt to new vendors seamlessly.• e.g. port a Redshift and Snowflake data lake based on offers or requirements, use AWS as a

cloud provider this quarter, but Azure the next.

• It is important to have an ETL tool able to work in a multi-cloud environment andto know how to adapt it to new providers and deployment environments (by themodification of some components), but preserving the logic of the business andthe transformation.

Page 20: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ETL tools: integration and scalability

• An ETL tool should work well with the latest innovations and easily adapt to newtechnologies.

• Good ETL tools will be able to integrate with serverless technologies, Spark, Snowflake,machine learning, etc., and quickly adapt to new technologies that we do not yet knowabout.

• Scalability is very important when choosing tools for data analytics given dedynamics of decision-making and strategic processes in organizations

• Horizontal / vertical scaling

• Portability is an important issue of ETL tools, but one frequently missed/avoided.• e.g. he Apache Hadoop ecosystem is moving fast. In 2014 and 2015 the standard was

MapReduce, but in late 2016 Spark became the new default solution. If you had opted forprogramming by hand in the day, it was impossible to port that code from MapReduce toSpark. The main ETL tools allow these types of changes without problems

https://www.talend.com/es/resources/etl-tools/

Page 21: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Commercial ETL software at different layers

• Hortonworks is now fused with Cloudera

https://www.talend.com/es/resources/etl-tools/https://streamsets.com/

Page 22: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

ETL software: examples

https://www.talend.com/blog/2018/06/25/why-data-scientists-love-python-and-how-to-use-it-with-talend/

https://streamsets.com/blog/automating-pipeline-development-with-the-streamsets-sdk-for-python/

Page 23: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco
Page 24: Ingeniería de Sistemas Data Analytics Prof: Hugo Franco

Presentation (next session)

• Presentation: Select an ETL tool from: • https://hevodata.com/learn/best-big-data-etl-tools/• https://cllax.com/top-11-best-etl-tools-list-for-big-data.html• https://www.scrapehero.com/best-data-management-etl-tools/• https://www.xplenty.com/blog/top-7-etl-tools/

Describe (15 minutes per group):• Detailed description of the tool (goal, methods, functionality, interfaces).• Pros and cons.• Use case example.