introduction to cascading
DESCRIPTION
Introduction to Cascading by Bryce Lohr Presentation on Cascading delivered at the Triad Hadoop Users Group. This presentation provides a brief introduction to Cascading, a Java library for developing scalable Map/Reduce applications on Hadoop. Bryce Lohr is a software developer at Inmar, focused on developing data analysis application using Hadoop and related technologies. https://www.linkedin.com/pub/bryce-lohr/3/589/225TRANSCRIPT
IN-0021
Introduction to Cascading10/16/2014
Bryce Lohr
Software Developer
Copyright 2010 FRATRES ERRABUNDI blog, https://blueridgetreks.wordpress.com/page/2/
“The Cascades”, Cascade Falls, Pembroke, VA
®© 2014 Inmar, Inc. All Rights Reserved.
What is Cascading?
• It’s a Java library
– Specifically for creating Hadoop applications
– High-level; abstracts away Map/Reduce paradigm
– Many pre-built data processing tools included
• Joins, aggregations, filters, data formats
3
®© 2014 Inmar, Inc. All Rights Reserved.
What can you do with Cascading?
• Optimized, production-scale operational processing
– For example, calculating how much to bill all of your hosting customers by processing all the logs from a server fleet each day
• Analytical queries
• ETL jobs
• Data preparation for machine learning
Not so great for ad-hoc queries.
4
®© 2014 Inmar, Inc. All Rights Reserved.
The Metaphor
• Think water flowing through pipes
– Flows have sources and sinks (drains)
– May converge or be split
– May pass through filters or turbines
• Elements of the abstraction
– Taps – sources, sinks
– Fields
– Pipes
– Operations
– Flows
– Cascades
5
®© 2014 Inmar, Inc. All Rights Reserved.
The Metaphor
From Cascading for the Impatient, Part 4: http://docs.cascading.org/impatient/impatient4.html
6
®© 2014 Inmar, Inc. All Rights Reserved.
What does it look like?
• Simplest possible Cascading job
– Distributed file copy
– https://github.com/Cascading/Impatient/blob/master/part1/src/main/java/impatient/Main.java
• Job shown in previous diagram
– Word counting algorithm with scrubbing and stop word filter
– https://github.com/Cascading/Impatient/blob/master/part4/src/main/java/impatient/Main.java
7
®© 2014 Inmar, Inc. All Rights Reserved.
Cascading compared to other tools
• vs. Hadoop MapReduce API
– Much higher level of abstraction; get more done in less time
• Ex. Implementing a simple join MapReduce
• vs. Hive
– Direct control of query optimization
• Hive optimizer still isn’t as good as typical SQL databases
– Easier to use data from a variety of sources in the same job
– Painless user defined functions
– Better potential management & monitoring in production
• vs. Pig
– Painless user defined functions
– Better potential management & monitoring in production
8
®© 2014 Inmar, Inc. All Rights Reserved.
Going forward
• Scalding/Cascalog
– Even-higher level DSL’s on top of Cascading, using Scala and Clojure, respectively
– Enables all the robustness and flexibility of the Java platform, often with more brevity than Hive and Pig scripts
• Next-generation processing engines
– Beyond Map/Reduce: Tez, Spark, Storm
• Efficient batch, in-memory, or stream processing with Cascading
– Cascading 3.0 will support Tez; Spark and Storm will be supported later
• 3.0-WIP is available to try out now!
9
© 2013 Inmar, Inc. All Rights Reserved.
Questions & Answers
10