test driven elephants

19
Test Driven Elephants: ETL Validation with Cascading on Elastic MapReduce

Upload: vijay-ramesh

Post on 11-May-2015

2.218 views

Category:

Technology


0 download

DESCRIPTION

Test Driven Elephants: ETL Validation with Cascading on Elastic MapReduce Presentation from Data Week 2013, Vijay Ramesh, Change.org

TRANSCRIPT

Page 1: Test Driven Elephants

Test Driven Elephants: ETL Validation with Cascading on Elastic MapReduce

Page 2: Test Driven Elephants

The problem

Difficulty with TDD on Hadoop:

· Slow feedback cycle

· End-to-end smoke tests only

· Requires Hadoop install in your CI environment

Page 3: Test Driven Elephants

Cascading’s solutionOrganize map/reduce jobs into flows, connecting:

· taps: source and sink - where your data is coming from, where it is going to (e.g., on HDFS or S3)

· pipe assemblies: what types of transformations/operations to run against it, specified independently of the data sources they process

from http://docs.cascading.org/impatient/

Page 4: Test Driven Elephants

And this buys us?

Lots of stuff: · Topological scheduling

· Reusable/parameterizable components

· Debug facilities (traps, intermediate sinks, logging, etc)

· LocalFlowConnector platform to run in-memory flows

· and... Testable Components!

Page 5: Test Driven Elephants

An example application

Verify* extract data from a variety of production sources (MySQL, MongoDB, etc)

*meaning data creation/update patterns fall within acceptable (experimentally determined) thresholds

“Medium” data:

· 60M+ users, 200M+ signatures, 1.7B+ emails

· 100M+ rows a week for more active sources

· Scale with number of sources & traffic patterns

Page 6: Test Driven Elephants

Cascading Flow

Initial source standardization:

Pipe source = new Rename( new Retain( new Pipe(sourceName + "/continuity"), sourceField ), sourceField, new Fields("timestamp"));

Page 7: Test Driven Elephants

Cascading Flow

Custom Pipe Assemblies:

@VisibleForTestingPipe timestampContinuityAggregate(Pipe source) { return new Every( new GroupBy( source, new Fields("time_bucket"), new Fields("timestamp") ), new Fields("timestamp", "time_bucket"), new TimestampContinuityAggregator(), new Fields("time_bucket", "average_delta", "max_delta") );}

Page 8: Test Driven Elephants

Pipe assembly: timestampContinuityAggregate

Page 9: Test Driven Elephants

Simple unit tests@Testpublic void testContinuityAggregate() throws Exception { Pipe continuityAggregate = controller.timestampContinuityAggregate( new Pipe("timestampContinuityAggregate") );

Flow flow = getPlatform().getFlowConnector().connect( source(new Fields("timestamp", "time_bucket"), "timeBucket"), sink(new Fields("time_bucket", "average_delta", "max_delta"), "timestampContinuityAggregate"), continuityAggregate );

flow.complete();

validateLength(flow, 2, null);

List<Tuple> values = getSinkAsList(flow); for (Tuple value : values) { Assert.assertTrue( value.getString(0).equals("2007-02-23") || value.getString(0).equals("2007-02-24") ); Assert.assertTrue(value.getString(1).equals("2717.5925925925926")); Assert.assertTrue(value.getString(2).equals("19331")); }}

Page 10: Test Driven Elephants

Simple unit tests@Testpublic void testContinuityAggregate() throws Exception { Pipe continuityAggregate = controller.timestampContinuityAggregate( new Pipe("timestampContinuityAggregate") );

Flow flow = getPlatform().getFlowConnector().connect( source(new Fields("timestamp", "time_bucket"), "timeBucket"), sink(new Fields("time_bucket", "average_delta", "max_delta"), "timestampContinuityAggregate"), continuityAggregate );

flow.complete();

validateLength(flow, 2, null);

List<Tuple> values = getSinkAsList(flow); for (Tuple value : values) { Assert.assertTrue( value.getString(0).equals("2007-02-23") || value.getString(0).equals("2007-02-24") ); Assert.assertTrue(value.getString(1).equals("2717.5925925925926")); Assert.assertTrue(value.getString(2).equals("19331")); }}

@Testpublic void testContinuityAggregate() throws Exception { Pipe continuityAggregate = controller.timestampContinuityAggregate( new Pipe("timestampContinuityAggregate") );

Flow flow = getPlatform().getFlowConnector().connect( source(new Fields("timestamp", "time_bucket"), "timeBucket"), sink(new Fields("time_bucket", "average_delta", "max_delta"), "timestampContinuityAggregate"), continuityAggregate );

Page 11: Test Driven Elephants

Simple unit tests@Testpublic void testContinuityAggregate() throws Exception { Pipe continuityAggregate = controller.timestampContinuityAggregate( new Pipe("timestampContinuityAggregate") );

Flow flow = getPlatform().getFlowConnector().connect( source(new Fields("timestamp", "time_bucket"), "timeBucket"), sink(new Fields("time_bucket", "average_delta", "max_delta"), "timestampContinuityAggregate"), continuityAggregate );

flow.complete();

validateLength(flow, 2, null);

List<Tuple> values = getSinkAsList(flow); for (Tuple value : values) { Assert.assertTrue( value.getString(0).equals("2007-02-23") || value.getString(0).equals("2007-02-24") ); Assert.assertTrue(value.getString(1).equals("2717.5925925925926")); Assert.assertTrue(value.getString(2).equals("19331")); }}

flow.complete();

validateLength(flow, 2, null);

List<Tuple> values = getSinkAsList(flow); for (Tuple value : values) { Assert.assertTrue( value.getString(0).equals("2007-02-23") || value.getString(0).equals("2007-02-24") ); Assert.assertTrue(value.getString(1).equals("2717.5925925925926")); Assert.assertTrue(value.getString(2).equals("19331")); }}

Page 12: Test Driven Elephants

Integration testingpackage org.change.ml.extract.verification;

import ...

@PlatformRunner.Platform({LocalPlatform.class})public class ControllerIntegrationTest extends VerificationTestCase { ...

@Test public void testVerifyTimestampContinuity() throws Exception { Tap signaturesSource = source(...) Tap continuitySink = sink(...) Tap verifiedSink = sink(...)

Flow verifyContinuity = controller.verifyTimestampContinuity( "signatures", new Fields("created_at"), new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse("2013-02-25 00:00:00"), signaturesSource, continuitySink, verifiedSink ); verifyContinuity.complete();

List<Tuple> continuityTuples = asList(verifyContinuity, continuitySink); List<Tuple> verifiedTuples = asList(verifyContinuity, verifiedSink);

Assert.assertTrue(continuityTuples.get(0).toString("\t").equals("2013-02-25\t7.166666666666667\t37")); Assert.assertTrue(continuityTuples.get(1).toString("\t").equals("2013-02-26\t720.6\t3599"));

Assert.assertTrue(verifiedTuples.get(0).toString("\t").equals( "2013-02-25,2013-02-26\t2013-02-26\t2013-02-25,2013-02-26\t2013-02-26") ); }...

}

Page 13: Test Driven Elephants

Integration testingpackage org.change.ml.extract.verification;

import ...

@PlatformRunner.Platform({LocalPlatform.class})public class ControllerIntegrationTest extends VerificationTestCase { ...

@Test public void testVerifyTimestampContinuity() throws Exception { Tap signaturesSource = source(...) Tap continuitySink = sink(...) Tap verifiedSink = sink(...)

Flow verifyContinuity = controller.verifyTimestampContinuity( "signatures", new Fields("created_at"), new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse("2013-02-25 00:00:00"), signaturesSource, continuitySink, verifiedSink ); verifyContinuity.complete();

List<Tuple> continuityTuples = asList(verifyContinuity, continuitySink); List<Tuple> verifiedTuples = asList(verifyContinuity, verifiedSink);

Assert.assertTrue(continuityTuples.get(0).toString("\t").equals("2013-02-25\t7.166666666666667\t37")); Assert.assertTrue(continuityTuples.get(1).toString("\t").equals("2013-02-26\t720.6\t3599"));

Assert.assertTrue(verifiedTuples.get(0).toString("\t").equals( "2013-02-25,2013-02-26\t2013-02-26\t2013-02-25,2013-02-26\t2013-02-26") ); }...

}

package org.change.ml.extract.verification;

import ...

@PlatformRunner.Platform({LocalPlatform.class})public class ControllerIntegrationTest extends VerificationTestCase { ...

@Test public void testVerifyTimestampContinuity() throws Exception { Tap signaturesSource = source(...) Tap continuitySink = sink(...) Tap verifiedSink = sink(...)

Page 14: Test Driven Elephants

Integration testing

Flow verifyContinuity = controller.verifyTimestampContinuity( "signatures", new Fields("created_at"), new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse("2013-02-25 00:00:00"), signaturesSource, continuitySink, verifiedSink );

package org.change.ml.extract.verification;

import ...

@PlatformRunner.Platform({LocalPlatform.class})public class ControllerIntegrationTest extends VerificationTestCase { ...

@Test public void testVerifyTimestampContinuity() throws Exception { Tap signaturesSource = source(...) Tap continuitySink = sink(...) Tap verifiedSink = sink(...)

Flow verifyContinuity = controller.verifyTimestampContinuity( "signatures", new Fields("created_at"), new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse("2013-02-25 00:00:00"), signaturesSource, continuitySink, verifiedSink ); verifyContinuity.complete();

List<Tuple> continuityTuples = asList(verifyContinuity, continuitySink); List<Tuple> verifiedTuples = asList(verifyContinuity, verifiedSink);

Assert.assertTrue(continuityTuples.get(0).toString("\t").equals("2013-02-25\t7\t37")); Assert.assertTrue(continuityTuples.get(1).toString("\t").equals("2013-02-26\t720.6\t3599"));

Assert.assertTrue(verifiedTuples.get(0).toString("\t").equals( "2013-02-25,2013-02-26\t2013-02-26\t2013-02-25,2013-02-26\t2013-02-26") ); }...

}

Page 15: Test Driven Elephants

Integration testingpackage org.change.ml.extract.verification;

import ...

@PlatformRunner.Platform({LocalPlatform.class})public class ControllerIntegrationTest extends VerificationTestCase { ...

@Test public void testVerifyTimestampContinuity() throws Exception { Tap signaturesSource = source(...) Tap continuitySink = sink(...) Tap verifiedSink = sink(...)

Flow verifyContinuity = controller.verifyTimestampContinuity( "signatures", new Fields("created_at"), new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse("2013-02-25 00:00:00"), signaturesSource, continuitySink, verifiedSink ); verifyContinuity.complete();

List<Tuple> continuityTuples = asList(verifyContinuity, continuitySink); List<Tuple> verifiedTuples = asList(verifyContinuity, verifiedSink);

Assert.assertTrue(continuityTuples.get(0).toString("\t").equals("2013-02-25\t7\t37")); Assert.assertTrue(continuityTuples.get(1).toString("\t").equals("2013-02-26\t720.6\t3599"));

Assert.assertTrue(verifiedTuples.get(0).toString("\t").equals( "2013-02-25,2013-02-26\t2013-02-26\t2013-02-25,2013-02-26\t2013-02-26") ); }...

}

verifyContinuity.complete(); List<Tuple> continuityTuples = asList(verifyContinuity, continuitySink); List<Tuple> verifiedTuples = asList(verifyContinuity, verifiedSink);

Assert.assertTrue(continuityTuples.get(0).toString("\t").equals("2013-02-25\t7\t37")); Assert.assertTrue(continuityTuples.get(1).toString("\t").equals("2013-02-26\t720.6\t3599"));

Assert.assertTrue(verifiedTuples.get(0).toString("\t").equals( "2013-02-25,2013-02-26\t2013-02-26\t2013-02-25,2013-02-26\t2013-02-26") );

Page 16: Test Driven Elephants

Running tests locally

Page 17: Test Driven Elephants

is awesome!

Check out https://github.com/Cascading/cascading.samples for examples using Gradle to drive Cascading

Page 18: Test Driven Elephants

Now what?

· Continuous Deployment· Gradle on Circle CI build to S3

· Elastic MapReduce scaleable· DataPipeline provisioned clusters

· Production Monitoring · SNS to Pager Duty integration on failures

Page 19: Test Driven Elephants

We’re hiring!

Thanks to my colleagues at Change.org for all their help with this presentation.

If you’re interested in building large-scale distributed systems for data processing and recommendations (along with lots of other cool stuff that helps empower people everywhere to create the change they want to see), we’re hiring!

Drop me an email ([email protected]) or visit our website (http://www.change.org/hiring)