Download - glennengstrand.infoglennengstrand.info/media/fpbd.pdf · processing of data as streams, lambda ... Big Data Pipeline andAnalytics Platform Using NetflixOSS and Other Open Source Libraries

In a previous blog, I shareda learning adventure whereI wrote a news feed servicein Clojure and tested it,under load, on AWS. One ofthe biggest impressions thatClojure made on me washow its collection orientedfunctions like union,intersection, difference,map, and reduce made itwell suited for dataprocessing. I wondered howwell that this dataprocessing suitability wouldtranslate to a distributedcomputing environment.

Recently, I attendedOSCON 2014 which is atech conference thatfocuses on open sourcesoftware. One of the trackswas on FunctionalProgramming. Two of thepresentations that Iattended talked about bigdata and mentioned opensource projects thatintegrate Clojure withHadoop. This blog exploresfunctional programming andHadoop by evaluating thosetwo open source projects.

http://clojure.org/

http://glennengstrand.info/software/architecture/oss/clojure

Functional Programming and Big Databy Glenn Engstrand (September 2014)

What is FunctionalProgramming? It isa style ofprogramming thatemphasizesimmutable state,higher orderfunctions, theprocessing of dataas streams, lambdaexpressions, andconcurrencythrough SoftwareTransactionalMemory.What is Clojure?That is a FunctionalProgramminglanguage similar toLisp that runs in theJava VirtualMachine. Additionalfeatures include;dynamicprogramming, lazysequences,persistent datastructures, multimethod dispatch,deep Javainteroperation,atoms, agents, andreferences.

What is Big Data?Systems whose dataflows at a highvolume, at fastvelocity, comes in awide variety, andwhose veracity issignificant areattributed as BigData. Usually thistranslates topetabytes per day.Big data must beprocessed in adistributed manner.Big data clustersusually scalehorizontally oncommodityhardware. TheApache Hadoopproject is the mostpopular enablingtechnology. A typicalBig Data project isbuilt with eitherCloudera orHortonworksofferings. UsuallyOLAP serves as thefront end where theprocessed data isanalyzed.

http://glennengstrand.info/analytics/fp

In one of those presentations, entitled BigData Pipeline and Analytics Platform UsingNetflixOSS and Other Open SourceLibraries, two engineers at Netflix talkedabout a lot of open source projects thatthey use for their big data pipe. One ofthose projects was called Pig Pen.

For anyone with experience with reportwriters or SQL, Pig Pen will look like it ispretty easy to understand. Commands likeloadtsv, map, filter, groupby, and storeseem pretty intuitive at first glance.

From an architectural perspective, Pig Pen is an evil hack. Your Clojure code getsinterpreted as a User Defined Function in a Pig script that loads and iterates throughthe specified input file. First, you have to run your code to generate the script. Then,you have to remove the part of the project specification that identifes the mainprocedure and compile your code to create an uberjar and copy it to a place wherethat Pig script can load it.That, and the fact that this was the slowest running and worse documented projectthat I evaluated, makes Pig Pen pretty much a non starter as far as I am concerned.

http://www.oscon.com/oscon2014/public/schedule/detail/34159

https://github.com/Netflix/PigPen/

Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries

Functional Programming and Big Data page 2 of 5 pages

Cascalog was one of the recommendedopen source projects that integrates Clojurewith Hadoop. Cascalog is built on top ofCascading which is another open sourceproject for managing ETL within Hadoop.

Cascalog takes advantage of Clojurehomoiconicity where you declare your queryto be run later in a multipass process. Manythings, such as grouping are not explicitlyspecified. The system just figures it out.Order of statements doesn't matter either.

At first glance, Cascalog looks like it is ideomatic Clojure but it really isn't. It is declarativeinstead of imperative. It looks to be pure Functional Programming but, under the covers, itisn't. More on this in a future blog.

http://www.oscon.com/oscon2014/public/schedule/detail/34913

http://www.cascading.org/http://cascalog.org/


Data Workflows for Machine Learning

Apache Spark is a relatively new project thatis gaining a lot of traction in the Big Dataworld. It can integrate with Hadoop but canalso stand alone. The only FunctionalProgramming language supported by Sparkis Scala.

Spark has three main flavors; map reduce,streaming, and machine learning. What I amcovering here is the map reducefunctionality. I used the Spark shell which ismore about learning and interactivelyexploring locally stored data.

Spark was, by far, the fastest technology thattook the fewest lines of code to run the samequery. Like Pig Pen, Map Reduce Spark waspretty limited on reduce side aggregation. AllI could do was maintain counters in order togenerate histograms. Cascalog had muchmore functional and capable reduce sidefunctionality.

If you are coming to Scala from Java or justhave little FP experience, then what youmight notice the most about Scala is lambdaexpressions, streams, and forcomprehensions. If you are already familiarwith Clojure, then what you might notice themost about Scala is strong typing.

http://www.scalalang.org/

https://spark.apache.org/


What is reduce side aggregation?There are two parts to the map reduceapproach to distributed computing, mapside processing and reduce sideprocessing. The map side is concernedwith taking the input data in its raw formatand mapping that data to one or morekey value pairs. The infrastructureperforms the grouping functionality wherethe values are grouped together by key.The reduce side then does whateverprocessing it needs to do on that per keygrouping of values.Why is map reduce Spark more limited inits reduce side aggregation capabilitythan Cascalog?In the map reduce flavor of ApacheSpark, the reducer function takes twovalues and is expected to return onevalue which is the reduced aggregationof the two input values. The infrastructurethen keeps calling your reducer functionuntil only one value is left (or one valueper key depending on which reducemethod you use). The reducer function inCascalog gets called only once for eachkey but the input parameter is asequence of tuples, each tuplerepresenting the mapped data that is avalue associated with that key.

I wrote the same job, generating the samereport, in all three technologies but Sparkand Scala was the fastest. Is that a faircomparison? You have to run both the PigPen job and the Cascalog job in Hadoop.For my evaluation, I was using single nodemode. Spark is different. You don't run itinside of Hadoop and I am pretty sure that itisn't doing much or anything with Hadooplocally.Large Spark map reduce jobs would have tointegrate with Hadoop so this may not be afair comparison.

But the real test with any technology isn'thow you can use it but rather how you canextend its usefulness. For Cascalog, thatcomes in the form of custom reduce sidefunctions. In this area, Pig Pen wascompletely broken.In the next blog, I will review my findingswhen I reproduced the News FeedPerformance map reduce job, used to loadthe Clojure news feed performance data intoOLAP, with a custom function written inCascalog.

In the next blog, I will also cover how Spark streaming can be made to reproduce the NewsFeed Performance map reduce job in realtime.

In this article, I explored Functional Programming in Big Data by evaluating three differentFP technologies for Big Data. I did this by writing the same Big Data report job in thosetechnologies then comparing and contrasting my results. Hive is still the most popular wayto aggregate Big Data in Hadoop. What would that job, used to evaluate those technologies,have looked like in Hive? First, you would have had to load the raw text data into HDFS.bin/hadoop dfs put perfpostgres.csv /user/hive/warehouse/perfpostgres

Then you would have had to define that raw text data to Hive.create external table news_feed_perf_raw (year string, month string, day string,hour string, minute string, entity string, action string, duration string) rowformat of delimited fields terminated by ',' lines terminated by '\n' stored astextfile location '/user/hive/warehouse/perfpostgres';

Here is the map part of the processing.insert overwrite local directory 'postActions' select concat(year, ',', month,',', day, ',', hour, ',', minute) as ts from news_feed_perf_raw where action ='post';

Here is the reduce part of the processing.insert overwrite local directory 'perMinutePostActionCounts' select ts, count(1)from postActions group by ts;


http://glennengstrand.info/analytics/oss

Download - glennengstrand.infoglennengstrand.info/media/fpbd.pdf · processing of data as streams, lambda ... Big Data Pipeline andAnalytics Platform Using NetflixOSS and Other Open Source Libraries

Top Related