apache spark usage in the open source ecosystem
TRANSCRIPT
![Page 1: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/1.jpg)
Apache Spark Usage in the Open Source Ecosystem
Hossein Falaki @mhfalaki
![Page 2: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/2.jpg)
About me
• Software Engineer / part-time Data Scientist at Databricks• I started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source• Worked on SparkR and R notebooks at Databricks
2
![Page 3: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/3.jpg)
Stackoverflow 2016 trending tech
3
![Page 4: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/4.jpg)
Apache Spark PhilosophyUnified engineSupport end-to-end applications
High-level APIsEasy to use, rich optimizations
Integrate broadlyStorage systems, libraries, etc
SQLStreaming ML Graph
…
1
2
3
![Page 5: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/5.jpg)
Databricks Community Edition
• In February Databricks launched a free version of its cloud based platform in beta
• Since then more than 8,000 users registered• Users created over 61,000 notebooks in different languages• This is an analysis of third party libraries that our beta users
imported to complement Apache Spark in Scala, Python, and R
5
![Page 6: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/6.jpg)
What % of users use other librariesLanguage %usersimporting externallibs Average#libs Median#libsPython 75% 9 2Scala 55% 3 1R 57% 6 1
6
![Page 7: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/7.jpg)
Installing libraries is easy
7
![Page 8: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/8.jpg)
Python Packages
8
![Page 9: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/9.jpg)
Most popular Python packages
9
![Page 10: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/10.jpg)
What is test_helper?
10
![Page 11: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/11.jpg)
What are these?
ETL• re• datetime• pandas• json• csv• string• math / operator• urllib / urllib2
11
Visualization• matplotlib• ggplot• seaborn
Advanced analytics• numpy• sklearn• graphframes• tensorflow• scipy
Other• test_helper• os• md5
![Page 12: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/12.jpg)
Python package categories
12
![Page 13: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/13.jpg)
What packages go together?
13
![Page 14: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/14.jpg)
Scala Packages
14
![Page 15: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/15.jpg)
Most popular Scala libraries
15
![Page 16: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/16.jpg)
What are these?
ETL• java/scala util• scala.collection• scala.math• java.{io, nio}• java.text• o.a.commons• kafka• twitter4j
16
Visualization• ?
Advanced analytics• spark.ml• graphframes
Other• java.net• scala.sys
![Page 17: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/17.jpg)
Scala package categories
17
![Page 18: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/18.jpg)
What libraries go together?
18
![Page 19: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/19.jpg)
R Packages
19
![Page 20: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/20.jpg)
Most popular R packages
20
![Page 21: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/21.jpg)
What are these?
ETL• dplyr• plyr• reshape2• jsonlite• tidyr• lubridate• httr• data.table
21
Visualization• ggplot2• beanplot• plotly• ...
Advanced analytics• sparkr• h2o• caret• e1071
Other• devtools• magrittr
![Page 22: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/22.jpg)
R package categories
22
![Page 23: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/23.jpg)
Comparing Python, Scala & R
23
![Page 24: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/24.jpg)
Languages have unique features
24
Scala / Python / R R / Python Scala / Python / R
• 25 % of users, use multiple languages• 3% of notebooks mix different languages
![Page 25: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/25.jpg)
Summary
• Spark users extensively mix it with other packages in different languages– One of goals of Spark project is working well with other projects
• ETL related libraries are the most popular category– Opportunities for new data sources
• Notebooks are being used for “small data” as well as “big data.”• Languages and their ecosystems have diverse capabilities. Users seem to
be mixing languages to their advantage– Scala is missing visualization libraries
25
![Page 26: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/26.jpg)
Try your favorite library in Databricks
26
http://databricks.com/ceTry latest version of Apache Spark and preview of Spark 2.0
![Page 27: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/27.jpg)
Thank you!
![Page 28: Apache Spark Usage in the Open Source Ecosystem](https://reader034.vdocuments.site/reader034/viewer/2022042611/58f0f5421a28abc5258b459f/html5/thumbnails/28.jpg)
What packages are used together?
28