multi dimension aggregations using spark and dataframes
TRANSCRIPT
![Page 1: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/1.jpg)
Multidimensional Aggregations using Spark and DataFrames2015-11-10Romi Kuntsman, Senior Big Data Engineer
![Page 2: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/2.jpg)
About me• Leading adoption of Apache Spark in Totango• Working with Spark for 1.5 years from version 1.0• Passionate about actionable big data analytics• Working with web scale and cloud since 2008• Previously: Outbrain, Foresight, RockeTier,
Mamram• B.Sc. in Bioinformatics from Open University • LinkedIn: https://il.linkedin.com/in/romik• email: [email protected]
![Page 3: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/3.jpg)
Agenda
• Totango Data Flow Overview
• Apache Spark DataFrames Introduction
• Merging Multiple Results Efficiently
• Open issues and questions
![Page 4: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/4.jpg)
Data Flow Overview“Numbers have an important story to tell.They rely on you to give them a voice.”
– Stephen Few
![Page 5: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/5.jpg)
Let's talk about aggregations
You've all done this...
SELECTmodule,count(*)
FROMactivities
GROUP BYmodule
![Page 6: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/6.jpg)
Aggregations with big data
You probably done or seen this before as well...
![Page 7: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/7.jpg)
Life isn't so simple
![Page 8: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/8.jpg)
Multiple levels of calculations
![Page 9: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/9.jpg)
Different points of view
• First level aggregations (across last 7, 14, 30 days etc)–Counts (per account, activity, module, user etc)–Distinct counts (unique users in module etc)–Sessions (multiple activities grouped by time proximity)–Activity days (how many days had any activity)
• Higher level analytics:–Engagement Score (overall activity compared to others)–Change Metrics (how activity changes over time)–Account Health (good, average or poor)
• And more...
![Page 10: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/10.jpg)
What do we need
• Easy way to develop a new aggregations• No boilerplate code, just business logic• Scalable and distributed• Accurate results (often underestimated)• Fast (short batch, but not realtime in this case)• Idempotent (same results on every run on same
input)• Multi-tenant (same computations on isolated
datasets)
![Page 11: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/11.jpg)
Spark DataFrames“Simple things should be simple,
complex things should be possible.”– Alan Key
![Page 12: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/12.jpg)
Spark DataFrames
• Table-like abstraction on top of Big Data• Able to scale from kilobytes to petabytes, node to
cluster• Transformations available in code or SQL• User defined functions can add columns• Actively developed optimizer• Spark 1.3 (March 2015) - initially released • Spark 1.4 (June 2015) - mature and usable• Spark 1.5 (September 2015) - performance optimized• Spark 1.6 (not yet released) - more optimizations
![Page 13: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/13.jpg)
Look ma, no map reduce!
• module counts:–events.groupBy(module).count
• module unique users:–events.groupBy(module,user).count.group
By(module).count
![Page 14: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/14.jpg)
User defined function
• activity days:
–udfRegistration.register("date_to_days",
new DateToDays())
–eventsWithDate = sqlContext.query(
"select *,date_days(date) from events")
–eventsWithDate.groupBy(module,day).count
![Page 15: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/15.jpg)
RDDs interoperate with DataFrames
Note: sometimes we do need to go from DataFrame to Java and back to accomplish some things:
RDD<FooBar> myRdd =
dataframe.toJavaRDD.map(...).groupBy(...)
newDataFrame = createDataFrame(myRdd,
FooBar.class)
![Page 16: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/16.jpg)
Advantage: speed, ease of development
Disadvantages: less flexible, limited aggregations,
strict simple schema
When going from DF to RDD: toJavaRDD forces
computation; losing Catalyst optimizer in the
transition
Future: maybe can be replaced by UDAF (user
defined aggregate function) in upcoming Spark
releases
DataFrames vs. RDDs
![Page 17: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/17.jpg)
Merge Multiple Results
![Page 18: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/18.jpg)
Merge the results
We've calculated aggregations across various dimensions. Now it's time to collect them grouped by entity (account, user, etc).
![Page 19: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/19.jpg)
Partitioning scheme
• RDD<Value> - not partitioned by key (there is not key…)→ Union of many RDD results will shuffle everything
• DataFrames are not partitioned by column (to be fixed…)→ Union of many DFs results will shuffle everything
• PairRDD<Key,Value> with partitionBy(partitioner) is partitioned→ Union of many PairRDDs which used the same partitioner will be partitioned together!Partitioner interface: (default HashPartitioner fits most cases)
int getPartition(key)
int numPartitions
![Page 20: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/20.jpg)
Number of partitions
• Processing always happens in chunks that can fit into one executor memory
• Too few partitions - some may not fit and you get a OOM
• Too many partitions - many small steps and overall long time
• In a multitenant environment - have to find a formula by input size that works for everyone, from smallest to largest
• When re-partitioning, take note of data being reshuffled
• No magic formula for the optimal number of partitions :-(
![Page 21: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/21.jpg)
Name your stages• Stages can be named
sparkContext.setCallSite (per thread)
![Page 22: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/22.jpg)
To cache or not to cache
• With RDDs you cache at every intersection
• With DataFrames, best to cache input, then optimizer
plans
• Cache when dividing input into sub sections (like time
slices)
• For Caching DataFrame - need to cause computation,
otherwise only LogicalPlan is cache and optimizer
decides what to do (for example when we cache for
time data subset)
![Page 23: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/23.jpg)
More Spark gotchas...
• When loading from Parquet, can't partition by column hash, only by column value
• Use Kryo for serialization (register all classes)• Use standalone shuffle service to avoid losing
shuffles when worker crashes (like in OutOfMemory)
We'll upload separate posts aboutthese and others on our blog
http://labs.totango.com/
![Page 24: Multi dimension aggregations using spark and dataframes](https://reader030.vdocuments.site/reader030/viewer/2022012813/58a7c45e1a28ab6b5a8b543f/html5/thumbnails/24.jpg)
• Check out our blog: http://labs.totango.com/• We're hiring!
http://www.totango.com/jobs/–Backend / Big Data Engineers–DevOps–Application / FrontEnd
• Stay in touch–[email protected]–https://il.linkedin.com/in/romik
Questions?