data pipeline testing now made easy! · what is in store for you? some history and introduction to...

31
Data Pipeline testing now made easy! With Apache Falcon

Upload: others

Post on 04-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Data Pipeline testing now made easy!

With Apache Falcon

Page 2: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

$whoami❖ Pallavi Rao

➢ Architect, InMobi➢ Committer, Apache Falcon➢ Contributor, Apache PIG

❖ Pavan Kumar Kolamuri➢ Sr. Software Engineer, InMobi➢ Contributor, Apache Falcon and Oozie

2

Page 3: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

What is in store for you?❖ Some history and introduction to Apache Falcon

❖ Falcon Unit - A new feature in v0.7

❖ Falcon Unit - How it simplifies testing pipelines

❖ Demo , Q&A

3

Page 4: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Once upon a time...

4

Page 5: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

What kept us up at night?

5

❏ Failures

❏ Data arriving late

❏ Re-processing

❏ Varied Data Replication

❏ Varied Data Retention

❏ Data Archival

❏ Lineage

❏ SLA monitoring

Page 6: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

The pattern

6

Page 7: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

7

The concoction

Page 8: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Concoction.. distributed

8

Page 9: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Some maladies cured

9

Data Management

Data Governance

Process Management

● Relays● Late Data Handling● Failure Retries ● Reruns

● Data Import/Export● Retention● Replication● Archival

● Lineage● Audit● SLA

Page 10: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Sample pipeline in Falcon

10

Click Logs

Click Enhancer

Enhanced Clicks

Hourly Aggregation

Hourly Clicks

Daily clicks

Daily Aggregation

Metadata

Retention : 2 hours Frequency : 5 minsLate Data arrival

Retention : 2 daysReplication required

Retention : 1 day

Retention : 7 days Replication required

Falcon Feed

Falcon Process

Page 11: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Cluster Specification<cluster colo="default" description="" name="corp" xmlns="uri:falcon:cluster:0.1"> <tags>[email protected], [email protected], _department_type=forecasting</tags> <interfaces> <interface type="readonly" endpoint="webhdfs://localhost:14000" version="1.1.2"/> <interface type="write" endpoint="hdfs://localhost:9000" version="1.1.2"/> <interface type="execute" endpoint="localhost:8032" version="1.1.2"/> <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.1.0"/> <interface type="registry" endpoint="thrift://localhost:12000" version="0.11.0"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <location name="temp" path="/tmp"/> <location name="working" path="/projects/falcon/working"/> </locations> <properties> <property name="field1" value="value1"/> </properties></cluster>

11

How to access data

Where to execute

HCat

Falcon cache

User defined props

Page 12: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Feed specification<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="corp" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}

/${MINUTE}"/> </locations> </cluster> </clusters> …..</feed>

12

Frequency

Location

SLA Monitoring

Data Retention

Data Replication

Page 13: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Process specification

13

<process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"

/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process></process>

Where should the process run?

How should the process run?

What to consume?

What to produce?

Processing logic

Late Data processing

Page 14: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Why Falcon Unit?

14

Page 15: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Before Falcon Unit

Unit Tests for each module using either PigUnit or MRUnit or JUnit.

Integration Tests executed by bringing up AWS instances with the entire stack.

15

Page 16: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Before Falcon Unit

16

Falcon Feed

Falcon Process

Spec. Invalid

OK

OK

Invalid Output

Invalid Input

Improper Replication

OK

OK

Page 17: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Motivation for Falcon Unit

❖ User errors caught only at deploy time.

➢ Input/Output feeds and paths not getting resolved.

➢ Errors in specification.

❖ Integration Tests require environment setup/tearDown.

➢ Messy deployment scripts.

➢ Time consuming.

❖ Debugging was cumbersome.

➢ Logs scattered.

17

Page 18: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Falcon Unit

18

Falcon Unit

In Process execution env.● Local Oozie● Local File System● Local Job Runner● Local Message Queue

Actual cluster● Oozie● HDFS● YARN● Active MQ

Test suite

Page 19: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

What you can test

19

Data Management

Data Governance

Process Management

● Data creation● Data injection● Retention● Replication

● Lineage● Data availability for verification

● Validation of definition ● Entity scheduling and status verification● Correctness of data window being picked up.● Reruns● Missing dependencies/properties

Page 20: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

After Falcon Unit

20

Falcon Feed

Falcon Process

OK

OK

OK

OK

OK

OK

OK

OK

TESTED

OK

Page 21: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Falcon Unit Illustrated

21

Page 22: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Capabilities with example

22

❖ Entity creation and data flow validation.

❖ Data Injection.

❖ Data Retention and Replication.

❖ Seamless API for cluster and local mode.

Page 23: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Example Pipeline

23

Hourly Clicks

Daily clicks

Daily Aggregation

...

Consumes

Produces

Deferred Clicks

...

Page 24: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

24

Cluster CreationCluster Creation :

→ Local Mode

submit(EntityType.Cluster, coloName, clusterName, propsMap);

submitCluster(); - Uses defaults

→ Cluster Mode

submit(EntityType.Cluster, <Path to Cluster XML>);

Page 25: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Feed CreationSubmit Feed

submit(EntityType.Feed, <Path to Hourly Clicks XML>);

Inject DatacreateData("HourlyClicks", "local", scheduleTime, <test data path>, numinstances);

25

/projects/falcon/clicks/hourly/2015/09/28/00/<data>

…./01/<data>

…./02/<data>

…./03/<data>

…./04/<data>

Page 26: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Process CreationProcess Submission:

submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local

submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode

Process Scheduling:scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName);

Process Verification:getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime);

26

Page 27: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Data Retention Data Retention: ● Data retention can be validated by scheduling feed in both cluster mode

and local modecreateData("HourlyClicks", "local", timeStamp, <test data path>);

schedule(EntityType.FEED, "HourlyClicks", "local" );

status = getInstanceStatus(EntityType.FEED, "HourlyClicks");

● Falcon Unit provides APIs for validation of existence of paths.

27

Page 28: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Data Replication

Data Replication can also be tested using Falcon Unit :submitCluster(coloName, srcCluster, propsMap);

submitCluster(coloName, targetCluster, propsMap);

createData("HourlyClicks", "srcCluster", timeStamp, <test data path>);

schedule(EntityType.FEED, “HourlyClicks”, targetCluster);

status = getInstanceStatus(EntityType.FEED, feed, targetCluster);

Assert.assertEquals(status, WorkflowStatus.SUCCEEDED);

28

Page 29: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Going forward ...

❖ Improved data injection➢ Generation of test data from template➢ Sampling of production data for testing

❖ Support for other data lifecycle operations➢ Data ingestion, export

❖ Maven plugin for build time validation of definitions.

29

Page 30: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Demo

30

Page 31: Data Pipeline testing now made easy! · What is in store for you? Some history and introduction to Apache Falcon Falcon Unit - A new feature in v0.7 Falcon Unit - How it simplifies

Questions?

31

If you want to ask later - [email protected] you want to contribute - [email protected]