intro to cascading
Post on 02-Jul-2015
259 Views
Preview:
DESCRIPTION
TRANSCRIPT
Cascadingor, “was it worth three days
out of the office?”
Agenda
What is Cascading?
Building cascades and flows
How does this fit our needs?
Advantages/disadvantages
Q&A
What is Cascading anyway?
Cascading 101
JVM framework and SDK for creating abstracted data flows
Translates data flows into actual Hadoop/RDBMS/local jobs
Huh?Okay, let’s back up a bit.
Data flowsThink of an ETL: Extract-Transform-Load
In simple terms, take data from a source, change it somehow, and stick the result into something (a “sink”)
Data source
Data sink
Extract Load
Transformation(s)
Data flow implementation
Pretty much everything we do is some flavor of this
Sources: Games, Hadoop, Hive/MySQL, Couchbase, web service
Transformations: Aggregations, group-bys, combined fields, filtering, etc.
Sinks: Hadoop, Hive/MySQL, Couchbase
Cascading 101 (Part Deux)
JVM data flow framework
Models data flows as abstractions:
Separates details of where and how we get data from what we do with it
Implements transform operations as SQL or MapReduce or whatever
In other words…An ETL framework.
A Pentaho we can program.
Building cascadesand flows
Cascading terminology
Flow: A path for data with some number of inputs, some operations, and some outputs
Cascade: A series of connected flows
More terminology
Operation: A function applied to data, yielding new data
Pipe: Moves data from someplace to some other place
Tap: Feeds data from outside the flow into it and writes data from inside the flow out of it
Simplest possible flow // create the source tap Tap inTap = new Hfs(new TextDelimited(true, "\t"), inPath); ! // create the sink tap Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath); ! // specify a pipe to connect the taps Pipe copyPipe = new Pipe(“copy"); ! // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); ! // run the flow flowConnector.connect(flowDef).complete();
We already have that.
!
It’s called ‘cp’.
Actually…
Runs entirely in the cluster
Works fine on megabytes, gigabytes, terabytes or petabytes; i.e., IT SCALES
Completely testable outside of the cluster
Who gets shell access to a namenode to run the bash or python equivalent?
Reliability is ESSENTIAL
!
if we, and our system, are to be taken srsly.
Reliability is a feature, not a goal.
Let’s do something more interesting.
Real world use case: Word counting
Read a simple file format
Count the occurrence of every word in the file
Output a list of all words and their counts
doc_id text doc01 A rain shadow is a dry area on the lee back side doc02 This sinking, dry air produces a rain shadow, or doc03 A rain shadow is an area of dry land that lies on doc04 This is known as the rain shadow effect and is the doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
Newline-delimited entries
ID and text fields, separated by tabs
Plan: Split lines into words and count them over each line
Flow I/O
Tap docTap = new Hfs(new TextDelimited(true, "\t"), docPath); Tap wcTap = new Hfs(new TextDelimited(true, "\t"), wcPath);
No surprises here:
docTap reads a file from HDFS
wcTap will write the results to a different HDFS file
File parsing Fields token = new Fields("token"); Fields text = new Fields("text"); RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]"); Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
Fields are names for the tuple elements
RegexSplitGenerator applies the regex to input and yields matches under the “token” field
docPipe takes each “token” generated by the splitter and outputs them
Count the tokens (words) Pipe wcPipe = new Pipe("wc", docPipe); wcPipe = new GroupBy(wcPipe, token); wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL);
wcPipe connects to docPipe, using it for input
Fit a GroupBy function onto wcPipe, grouping by the token field (the actual words)
for every tuple in wcPipe (every word), count each occurrence and output the result
Create and run the flow FlowDef flowDef = FlowDef.flowDef() .setName("wc") .addSource(docPipe, docTap) .addTailSink(wcPipe, wcTap); Flow wcFlow = flowConnector.connect(flowDef).complete();
Define a new flow with name “wc”
Feed the docTap (the original text file) into the docPipe
Feed the wcTap (the output word counts) into the wcPipe
Connect to the flowConnector (Hadoop) and go!
Cascading flow
100% Java
Databases and processing are behind class abstractions
Automatically scalable
Easily testable
How could this help us?
Testing
Create flows entirely in code on a local machine
Write tests for controlled sample data sets
Run tests as regular old Java without needing access to actual Hadoopery or databases
Local machine and CI testing are easy!
Reusability
Pipe assemblies are designed for reuse
Once created and tested, use them in other flows
Write logic to do something only once
This is *essential* for data integrity as well as good programming
Common code base
Infrastructure writes MR-type jobs in Cascading, warehouse writes data manipulations in Cascading
Everybody uses the same terms and same tech
Teams understand each other’s code
Can be modified by anyone, not just tool experts
Simpler stack
Cascading creates DAG of dependent jobs for us
Removes most of the need for Oozie (ew)
Keeps track of where a flow fails and can rerun from that point on failure
Disadvantages“silver bullets are not a thing”
Some bad news
JVM, which means Java (or Scala (or CLOJURE :) :)
Argument: Java is the platform for big data, so we can’t avoid embracing it.
PyCascading uses Jython, which kinda sucks
Some other bad news
Doesn’t have job scheduler
Can figure out dependency graph for jobs, but nothing to run them on a regular interval
We still need Jenkins or quartz
Concurrent is doing proprietary products (read: $) for this kind of thing, but they’re months away
Other bad news
No real built-in monitoring
Easy to have a flow report what it has done;hard to watch it in progress
We’d have to roll our own (but we’d have to do that anyway, so whatevs)
Recommendations“Enough already!”
Yes, we should try it.
It’s not everything we need, but it’s a lot
Possibly replace MapReduce and Sqoop
Proven tech; this isn’t bleeding edge work
We need an ETL framework and we don’t have time to write one from scratch.
Let’s prototype a couple of jobs and see what people other than me think.
Questions?Satisfactory answers
not guaranteed.
top related