introduction to cascading

10

Click here to load reader

Upload: cascading

Post on 01-Jul-2015

135 views

Category:

Technology


1 download

DESCRIPTION

Introduction to Cascading by Bryce Lohr Presentation on Cascading delivered at the Triad Hadoop Users Group. This presentation provides a brief introduction to Cascading, a Java library for developing scalable Map/Reduce applications on Hadoop. Bryce Lohr is a software developer at Inmar, focused on developing data analysis application using Hadoop and related technologies. https://www.linkedin.com/pub/bryce-lohr/3/589/225

TRANSCRIPT

Page 1: Introduction to Cascading

IN-0021

Introduction to Cascading10/16/2014

Bryce Lohr

Software Developer

Page 2: Introduction to Cascading

Copyright 2010 FRATRES ERRABUNDI blog, https://blueridgetreks.wordpress.com/page/2/

“The Cascades”, Cascade Falls, Pembroke, VA

Page 3: Introduction to Cascading

®© 2014 Inmar, Inc. All Rights Reserved.

What is Cascading?

• It’s a Java library

– Specifically for creating Hadoop applications

– High-level; abstracts away Map/Reduce paradigm

– Many pre-built data processing tools included

• Joins, aggregations, filters, data formats

3

Page 4: Introduction to Cascading

®© 2014 Inmar, Inc. All Rights Reserved.

What can you do with Cascading?

• Optimized, production-scale operational processing

– For example, calculating how much to bill all of your hosting customers by processing all the logs from a server fleet each day

• Analytical queries

• ETL jobs

• Data preparation for machine learning

Not so great for ad-hoc queries.

4

Page 5: Introduction to Cascading

®© 2014 Inmar, Inc. All Rights Reserved.

The Metaphor

• Think water flowing through pipes

– Flows have sources and sinks (drains)

– May converge or be split

– May pass through filters or turbines

• Elements of the abstraction

– Taps – sources, sinks

– Fields

– Pipes

– Operations

– Flows

– Cascades

5

Page 6: Introduction to Cascading

®© 2014 Inmar, Inc. All Rights Reserved.

The Metaphor

From Cascading for the Impatient, Part 4: http://docs.cascading.org/impatient/impatient4.html

6

Page 7: Introduction to Cascading

®© 2014 Inmar, Inc. All Rights Reserved.

What does it look like?

• Simplest possible Cascading job

– Distributed file copy

– https://github.com/Cascading/Impatient/blob/master/part1/src/main/java/impatient/Main.java

• Job shown in previous diagram

– Word counting algorithm with scrubbing and stop word filter

– https://github.com/Cascading/Impatient/blob/master/part4/src/main/java/impatient/Main.java

7

Page 8: Introduction to Cascading

®© 2014 Inmar, Inc. All Rights Reserved.

Cascading compared to other tools

• vs. Hadoop MapReduce API

– Much higher level of abstraction; get more done in less time

• Ex. Implementing a simple join MapReduce

• vs. Hive

– Direct control of query optimization

• Hive optimizer still isn’t as good as typical SQL databases

– Easier to use data from a variety of sources in the same job

– Painless user defined functions

– Better potential management & monitoring in production

• vs. Pig

– Painless user defined functions

– Better potential management & monitoring in production

8

Page 9: Introduction to Cascading

®© 2014 Inmar, Inc. All Rights Reserved.

Going forward

• Scalding/Cascalog

– Even-higher level DSL’s on top of Cascading, using Scala and Clojure, respectively

– Enables all the robustness and flexibility of the Java platform, often with more brevity than Hive and Pig scripts

• Next-generation processing engines

– Beyond Map/Reduce: Tez, Spark, Storm

• Efficient batch, in-memory, or stream processing with Cascading

– Cascading 3.0 will support Tez; Spark and Storm will be supported later

• 3.0-WIP is available to try out now!

9

Page 10: Introduction to Cascading

© 2013 Inmar, Inc. All Rights Reserved.

Questions & Answers

10