sds podcast episode 211 with javier luraschi · 2018. 11. 22. · have already incontinent the dark...
TRANSCRIPT
Kirill Eremenko: This is episode number 211 with software engineer at
RStudio Javier Luraschi.
Kirill Eremenko: Welcome to the SuperDataScience podcast. My name
is Kirill Eremenko, Data Science Coach and lifestyle
entrepreneur, and each week we bring you inspiring
people and ideas to help you build your successful
career in Data Science. Thanks for being here today
and now let's make the complex simple.
Kirill Eremenko: Welcome back to the SuperDataScience podcast.
Ladies and gentlemen, I'm very excited to have you on
the show today and today I've got a very interesting
guest Javier Luraschi on the show with us. Javier is a
Software Engineer at RStudio. If you use RStudio, you
might know a couple of packages that he's worked on
or and/or co-authored packages such as sparklyr,
mlflow, cloudml, and many more. And in fact, if you,
even if you're just a beginner in our, you probably
have already incontinent the dark theme in RStudio
and that's something that's Javier has also
contributed towards. So we've got a very exciting
podcasts coming up for you just now with Javier. We
talked a lot about Big Data and Big Compute and
specifically we talked a lot most of this podcast is
dedicated to Apache Spark.
Kirill Eremenko: So in this podcast you will find out it's the whole
history of Big Data where Hadoop came about from
why Apache Spark was created, how they compare,
what a punch just park is used for, how it's developed
over time, how it's developing now. You will also learn
a lot about package development in RStudio and some
of the exciting things that are happening in this space.
And in addition to all events, you will feel a lot of
passion. Javier has a ton of passion for this space, for
RStudio, for a Apache Spark, for developing packages,
for Big Data and Big Compute. So this podcast is full
of that. I was personally sitting on the edge of my seat
just enthralled by all these amazing stories that Javier
is telling and this whole space of Big Data.
Kirill Eremenko: So all in all, an incredibly exciting, captivating
podcast. Can't wait for you to check it out. So without
further ado, I bring to Javier Luraschi who is a
Software Engineer at RStudio.
Kirill Eremenko: Welcome Ladies and gentlemen to this
SuperDataScience podcast, today I have a very special
guest calling in all the way from Seattle Javier
Luraschi from RStudio. Javier, how are you going
today?
Javier Luraschi: I'm doing good Kirill. How about you?
Kirill Eremenko: I'm doing well as well and I'm in Australia right now in
Gulf Coast. The weather is quite warm here, unlike
Seattle. You said it's getting a bit cold there?
Javier Luraschi: Yeah, I'll probably have to go and visit you one of these
days 'cause definitely we can use some of your sun.
Kirill Eremenko: Yeah, for sure. And you mentioned winters are quite
harsh and saddle, like how cold does it get?
Javier Luraschi: It's not that it gets that cold about, it gets cold minus
five Celsius, which is not terrible, but we have very,
very long winters. So definitely expect a visit from me
somewhere around March or May. I'm assuming
Australia has pretty nice weather most of the year.
Kirill Eremenko: Yeah. Most of the year. It's quite good. The only thing
is that it's a bit like first time I got here I was not really
expecting that during winter it's warm during the day,
but then at night the temperature drops to like maybe
plus eight degrees Celsius, which is fine. But the thing
is they don't have central heating in the buildings. So
the building is actually cold and you have to get
blankets. So that was a bit unexpected. But other than
that it's really cool.
Javier Luraschi: I would say that at least in winter it's a time where you
can be very productive because you know, like it's
dark and everything and it's cold. So it's kind of like
puts you in more to just get things done which is good.
I kind of been on warm countries and sometimes
getting work done in warm countries gets trickier.
Kirill Eremenko: I know. You work faster when it's colder.
Javier Luraschi: I believe that's for me, but I don't know if that's the
same for everyone.
Kirill Eremenko: Same for me. All right. Well Javier RStudio. You're on
the podcast. It's really cool to have you. As I
mentioned, as I mentioned before the podcast already
spoken to Nathan Stephens, Nisha Iyer. I've recently
just been chatting on Linkedin to Jonathan Regenstein
all from RStudio. You have such a fantastic team
there. This is incredible. Every person I met from
RStudio is like some sort of very interesting genius in
their own field and I learned so much. I'm really
looking forward to this podcast. I got really high
expectations for today.
Javier Luraschi: Well we appreciate that, but like it's definitely fun. I
don't know about being all of us genius but definitely
it's fun. Especially sine we're a distributed company, I
don't even know honestly where Jonathan is these
days, but whoever you talk with RStudio it's likely that
they won't be on the same city 'cause we're just all
over the US also Europe and yeah.
Kirill Eremenko: It's crazy. How is it to work in a distributed company
like that?
Javier Luraschi: I really love it honestly. I do feel like it's different. I feel
like any other new job or any change in your life
usually have a honeymoon period right? Where you ...
Even Data Science, right? I bet you start with Data
Science and you're like, oh my God, this is great. This
is the best thing. For the most part. I think working in
a remote team has been that way. I do think, and I
would give us advice or unsolicited advice to your
listeners, right? There's definitely like a period where
you get, you have to tweak your personal life to also
maximize all of the the things that are not work
related, if you know what I mean, right. So like when
when you're in the same office it's very easy to have
those personal interactions and keep up with people
and what not and when you're in a distributed team
you kind of lose the personal face to face connection
unless you practically say hey I want to catch up with
these colleagues or friends that I haven't seen right?
Javier Luraschi: But net net, I think it's fantastic. I don't even know
what a community looks like and it's nice be close to
your family and what not so I definitely highly
recommend positions and remote teams.
Kirill Eremenko: That's so cool. That's so cool. And I totally agree with
the whole concept losing on the personal side of things
you need something you do a take care of. Like for
instance, at SuperDataScience, we have a remote team
and we all catch up, we make an effort to at least once
a year to all get into the same place and spend a week
together. And the other thing is like every week we
have a new body. So everybody in the team randomly
gets assigned somebody else so it could be like a
director. It could be an analyst together and for the
week, no, not for the week, for the month you guys are
buddies and so you plan your weeks together, you
catch up once a week and that really puts people from
different parts of their company closer to each other.
So I think you're right, the personal side of things.
Javier Luraschi: Yeah. So I'm actually curious like what does that
mean? 'Cause you're still remote, right? Does that
mean that you get to talk with this person in Slack or
whatever you're using or?
Kirill Eremenko: So Slack, you can talk to anybody anytime, but that
means that at least once a week you need to catch up
and talk to that other person in person. So we use
Zoom, so not like in person as in-
Javier Luraschi: Yeah, we Zoom as well.
Kirill Eremenko: But like on video you need to catch up with that
person at least once a week at the start of the week
when you're planning your week and you spend an
hour together, and so you have already chatting about
the weekend, what are you doing in your personal life?
And feedback's been really good. People love to find out
more about that.
Javier Luraschi: Yeah connecting with other colleagues and what not.
That makes a little sense.
Kirill Eremenko: That's cool. So tell us, how did you get into RStudio in
the first place? Before we continue, I just wanted to let
everybody know. So Javier is a software engineer at
RStudio. Maybe let's start there. Before we go into the
story of how you got into RStudio, tell us a bit about
what exactly do you do at RStudio?
Javier Luraschi: Sure. Well, I mean, as you mentioned, I'm a software
engineer, so I write software, right? That's kind of like
obvious. But I mostly have been focusing on R
Packages. For those of you that might not be super
familiar with what an R Package is, is basically our
code, which is just R is programming language, right?
And you package this code into modules that you can
share with other people, right? So this is actually, it's
a concept that you can find in other programming
languages, but I think what they R community, has
going is pretty special. 'Cause there's a very nice
relationship between R Package and an actual person
that lives and breathes in real life, right? So a lot of the
functionality that you use from R is going to come
from packages and those packages have maintainers
and authors. And at the end it's just R code, but it's
packaged in such a way that it makes it very easy for
you to reuse and for us maintainers to also keep up
with new releases and new features and why not.
Javier Luraschi: So yeah I do R Packages and specifically I do packages
that are mostly related with making R run faster or
upscale or helping you share models with other people.
So R Packages can basically do anything you want and
the ones that I focus on are mostly about helping you
work with Big Data and Big Compute and specifically
with a package called sparklyr.
Kirill Eremenko: Interesting. That's a very interesting description. I
thought that R Packages mostly come from CRAN and
that they are created by the community. I didn't know
that the R studio on purpose creates R packages. Can
you tell us the difference between those two please?
Javier Luraschi: Well, they're exactly the same. They're just two
different parts of the story, right. So basically an R
Packages it's literally like you could see it as a box
with code inside it and CRAN is the store all the
packages are free, right. But it's where you go and get
the actual packages. So there's people that build these
packages in the art community, they don't need to
work at RStudio. And honestly most of the packages
that are on CRAN are not from RStudio at all. So
anyone can package a useful piece of code into an R
Package and then make it available on CRAN. And
then anyone from the community can go to CRAN
search for packages and then install them, right. So
think about CRAN as the APP store, right? Whenever
you go to the APP store, you look through apps and
then you install them. CRAN is kind of like the APP
store from the R world. And some of the packages very
few are developed by RStudio. And many, many, many
otters are developed by other people in their
community.
Kirill Eremenko: Okay. Gotcha. And in that case, how does RStudio
decides what the community is going to develop and
what is going to be developed within RStudio, like for
instance, within your team and just like within the
company? Is there kind of some strategy behind that?
Javier Luraschi: Yeah. I think there's some honestly I don't think
there's an explicit strategy, like this is my own
personal point of view. So I wouldn't speak completely
on behalf of RStudio. But the way that I see it is that if
we see someone that is already working in a package
and it's a great package in the community, we just
simply don't work on it, right? We're very pragmatic in
the way we try to approach problems. So we try to look
for the things that are the most painful and the most
impactful that the community might need. And if we
look at a problem, right? Like in my case, like, hey, Big
Data, right? If someone in the community already
providing that and if the answer is no and there's
enough big impact that we can help the community,
we basically help.
Javier Luraschi: That's kind of the very high level, like, hey, this is kind
of how we see it. 'Cause yeah, I think it could be
possible that if they R community would take over the
entire packages, there wouldn't be a need for RStudio
perhaps to develop packages, right. But something
that perhaps, I don't know if you already talked about
this on the series is, but it is like the opportunity that
exists in Data Science in general, and in my particular
case, Big Data is so huge that, we need more people,
right. So it's not that we're fighting for packages which
is, I think it's a great place to be at. It's mostly like, oh
my God, Data Science is growing so much and there's
so much to be done that, let's just make sure that we
don't overlap because that's just an efficient. But
there's so much opportunity everywhere that we
haven't had that problem of hey, having to do a lot of
coordination or why not. It's mostly pretty obvious
when there's a gap or an opportunity that someone
can help with.
Kirill Eremenko: Gotcha. Okay. Very interesting role and lots of
packages. So maybe let's talk a little bit about some of
the recent development. So you mentioned Big Data.
Can you tell us a bit about Big Data, maybe even give
a quick overview for new listeners on the podcast who
are not familiar with the cost of Big Data. What is Big
Data and Big Compute? What's the story that's
happening there?
Javier Luraschi: No for sure. I honestly, I really love those two concepts,
Big Data and Big Compute and mostly I really like
them 'cause there's people that have very strong
feelings one way or the other one and you probably
have seen these around. So kind of to me the way that
I like to introduce them, it's just from a historical point
of view. Maybe that's the most boring point of view,
but I think it's also pretty exciting, right? So if you
think, like to me one of the most exciting things about
in general Data Science and Big Data is if you ask a
historian, right? Someone specializing in history, he's
like hey, how do you divide human development,
right? The development of the human civilization. It's
going to be pretty likely that they will mention things
like the stone age, pre industrial revolution, the
industrial revolution and why not.
Javier Luraschi: And a lot of them will mention also the information
age, right? I'm pretty sure that you might have already
heard this term, right? The information age and this
kind of concept basically means that there's a lot of
information that we're creating that is digital, right?
And if you look at one report from the World Bank,
which basically creates reports about how the world is
changing every year. One report that they had on 2016
kind of like try to analyze how much physical data we
have that's called analog data in some ways, like books
and paper and anything that leaves a paper trail and
why not. And also digital information, right. And you
can see in the report it's just growing an exponential
growth rates. It's just we don't have to explain this too
much, but if you look at how many cat videos we
create per day, right? Or Instagram photos or
Facebook photos or just data in general.
Javier Luraschi: We're creating a lot of information. So back in the year
2003 around the time, there were companies like
Google that were just starting, right? It's crazy, but
google is a pretty new company, right? It wasn't here,
what is that? 30 years ago or why not? And what
Google was trying to do is say, hey, we have the
worldwide web and there's a lot of information that
seems to usable. Can we make it searchable? And
there were obviously companies before them like
Yahoo and Excite and AltaVista for those of you that
remember, but Google was one of those companies
that said, how can we make search better?
Javier Luraschi: And the first program that you hit when you're talking
about the web, it's like you can't really put the whole
web in one computer, right? It just too much data on if
you try to put it in one computer, like it just doesn't
make any sense. So the way that they solved the
problem back then was by using multiple computers,
right? You have multiple computers, each with their
hard disk across many computers you can load the
entire web and make searches and that's basically
what gave birth to what's called a HDFS and which I
believe has been maybe mentioned already in your
podcast. Which is basically a way of splitting data into
multiple machines. Right. So that's kind of like the
beginning of Big Data is like, hey, if you have
something that doesn't fit in one machine, you have
Big Data and it's a pretty clear definition. There's
obvious problems that fall into this category, right?
Like, hey I have my computer, I'm doing analyses, I
can't have all this data in my machine, I need multiple
computers. Well now we're talking about Big Data.
Kirill Eremenko: And HDFS e Hadoop Distributed File System?
Javier Luraschi: Yeah, it's basically the Hadoop Distributed File
System. Basically the way that this started is Google
had a research paper where they explained to the
world, hey, we have the Google file system, right? And
we have a bunch of files, this is how we distribute
them and then some engineers in India who they said,
well, this sounds like a great idea, can we make it on
open source project? And that's where Hadoop came
from. Is this open source implementation of the
internal file system that Google made available to a
research publication, right. And let me know how this-
Kirill Eremenko: Sounds great [crosstalk 00:18:44].
Javier Luraschi: Yes, it's great okay. I want to make sure that I give
enough information but not too much.
Kirill Eremenko: No. I love it. You're totally right. It's the historic
context makes it even more exciting. Makes it like a
story, like the journey. So yeah, no, I'm just listening.
I'm really immersed in your explanation, please
continue.
Javier Luraschi: Well, yeah for sure. So far so good, right? Google was
doing their own thing and we have open source project
like Hadoop that were doing their own thing and as I
mentioned, they were mostly based on disk drives
right? Disk drives even today can store the most
information for the lowest cost, so it makes sense to
put information there. But then a different project
came out around, I want to make sure I get this right,
but I believe it was around 2009 there was a project in
Berkeley to kind of improve over Hadoop. Said, well,
Hadoop is great we're doing all these things based on a
file system based on disk drives, but can we do it
better? Can we do it faster? And sure enough, another
open source project started in Berkeley called Apache
Spark. And the premise at the point in time was, or
one of the things that changed from their precursor
was that, there was a trend in which memory RAM
computer memory, was getting cheaper.
Javier Luraschi: Not as cheap as disk drives, but it was getting a bit
cheaper, right. And it also happens to be the case that
your computer memory makes your computer pretty
fast, right? So whenever you go and buy a new
computer, the amount of memory is one of the things
that you want to go and check, oh, it has like four
gigabytes or eight gigabytes or 32 or whatever. It
makes a big difference on speed 'cause it means that a
computer can handle more things in a faster and
easier, way to access it, right. Anyways, so basically
the Apache Spark project started by figuring out how
to create something like Hadoop and improve over it
based on memory.
Javier Luraschi: And what they found out is that sure enough you can
get significant speedups by using memory instead of
disks. And one of the things that we like to do as
software engineers is sort data, sounds like such an
easy task just to order data. If you had a list of names
of people and you just want list them on alphabetical
order, it happens to be a pretty compute intensive
task, especially when you have a lot of data. Anyway,
so what they found out back then is that it used to
take to sort 100 terabytes, which is not a lot of data. It
used to take Hadoop 72 minutes and 2100 computers.
So, that was with Hadoop. And then with this new
framework based on memory and with have a more
rich vocabulary of operation that you could do, you
could do the same in 23 minutes but only use in 200
computers.
Javier Luraschi: So that's crazy, right? It's an improvement of 10X
performance, you need 10 times less computers and
you actually make it ... You can sort the information
faster. So it was a really, really big deal back then. And
in some ways it's still part of the main trend of trying
to figure out as a society and as humanity, how do we
order, how do we make sense of all this information,
right? How do we arrange it? How do we store it? How
do we answer harder and more interesting questions
over and over? So, that's pretty much what Spark is.
It's an in memory engine that allows you to process
any information. One of the things that is different
perhaps with Hadoop is that when you talk about
Hadoop, there was only support for one type of
operation called map reduce. And maybe this is also
something that people have heard around when
they're getting immersed into a Data Science and Big
Data, right? It's like map reduce
Javier Luraschi: Well, back when Hadoop was designed, there was
another paper from Google where they proposed this
model. Whenever you want to transform information, I
guess I should say that so far we've talked about
storing information on retrieving it. We haven't talked
enough on okay. Well information is across multiple
computers. How do you actually make it do something
meaningful if it's distributed across multiple
computers, right. So the first attempt to solve that,
which also was at around the time a couple of years
after Hadoop came to be, was another paper from
Google who called MapReduce, which basically
explained the world, say, hey, if you have a distributed
system, that basically means you have a lot of
computers, you can reduce all the operations to
mapping meaning transforming some information in
the same machine to some other information in the
same machine, and then combining the information
between machines.
Javier Luraschi: That's kind of like at a high level what it is. And it was
pretty good at that time. But it was also constraining
in the sense that it didn't provide a lot of verbs or kind
of like grammatical constructs to make coding more
easy, right? It was pretty bare bones at the time. And
one of the other big improvements from Spark is that it
enabled big vocabulary of operations to make Big Data
more easy to analyze, right? So instead of just saying,
hey, you need to tell me how to map data on each
machine on then aggregate it, it introduced things like,
hey, just tell me what you want filtered. What do you
want to the average to be like. Kind of like more things
that are now they feel closer to what data analyses.
Like, hey, I just want to count how many earthquakes
are available in these data sets. That's a pretty simple
question that used to be actually pretty tricky to ask
back when we only had MapReduce.
Javier Luraschi: So with Spark you can say things like, hey, from all
these data set, give me all the earthquakes that just
happened in Australia on this period and give me the
count. So that's a much more reach way of expressing
and analyzing data when it's distributed across
multiple machines that previous technologies didn't
really ... It wasn't as simple. You had to provide code
for you to process each disk drive on each machine
and then figure out how to upgrade and all that. So
yeah it was a big deal and it's still a big deal. I mean, it
has been less than 10 years. So if you look at the
Apache Spark project, I mean, we can talk about more
of the things that have happened and why not. It's still
changing a lot. So it's by no means what I would
consider it done. And it's growing everyday.
Kirill Eremenko: Gotcha. So if I understand correctly, Apache Spark not
only, is faster than Hadoop in the sense that it works
in memory, uses RAM instead of the disk data, but it
also is simpler to use to operate. Is that correct?
Javier Luraschi: Yeah. That is completely correct. And the thing to say
here, it sounds really easy and it is really easy
compared to dealing with MapReduce. But you're still
talking about hundreds of machines or at least tens of
machines. So it's a pretty hard problem and it's
surprising the amount of progress that the open
source community has made in the past 20 years.
Where now really anyone, if we want to get into
specifics of hey, how do I get into these? Really anyone
can download the tools that they need in less than 15
minutes and get up and running and start doing data
processing. That was really state of the art in
companies like Google or Yahoo like 15 years ago. So
that thing is just kind of like the technical explanation
of why Data Science and Big Data, it's such a big deal.
Javier Luraschi: It is 'cause these tools are so easy today than what
they were before that we honestly haven't figured out
yet how to fully apply them and we can maybe talk a
little bit about this. But just talking about potential.
It's like, well Google and Yahoo and other big
companies were using these tools and they were great
and now they just became so easy to use. What can we
do with them? And I'm sure a lot of your listeners will
have more particular ideas, but it's definitely
fascinating.
Kirill Eremenko: Yeah for sure. And do you have certain packages in R
which to allow people to work with Apache Spark?
Javier Luraschi: Yeah, that's a great question. And the answer is yes.
And the package that I recommend using, it's called
sparklyr. It's kind of like a corny name out of Spark.
You have Spark this is like sparklyr or sparkly R.
Yeah, so we try to name our package in fun ways
'cause it doesn't make sense to name them with boring
topics. So yeah, it's called sparklyr and it's basically
the way that you use Spark from the R programming
language, right. So R is obviously a pretty well known
programming language or computing language I would
want to say-
Kirill Eremenko: Very well known. My brother is studying at uni now
and he's actually using R. It's really going well.
Javier Luraschi: There you go. So if you're already have an age for
learning R, it's another natural progression to say well,
what about things like Big Data? And we haven't even
talked about Big Compute so we need to get back to
that. But yeah, definitely if you want to get involved
into doing things in cluster computing with Big Data
and why not, this R Package called sparklyr is a very
nice way of getting started. And again in this
particular case, it's a package that RStudio developed,
myself and other people in RStudio are authors of
these package. But it's actually an Apache licensed
open source project. So it's completely free. It's
available in CRAN, you can download it, it's easy to
install. And same for Spark. I didn't mention that, but
Spark also happens to be an open source project, it's
Apache licensed. So it's pretty much there for anyone
to use that has that need or interest.
Kirill Eremenko: That's very cool. That's very cool. How long does it take
you guys to do all the packages? Just out of curiosity?
Javier Luraschi: Well in all honesty, I feel like we're still working on it
and we're going to be working on it for a while and. So
the original package we worked for a few months in it
back in 2016. We actually announced it back in use R
2016. It was like the base basic functionality which
included things like being able to do data analysis with
dplyr which it's one of the most used packages in the
R community, which allows you to basically design
data manipulation operations through our grammar.
That makes it very easy to let you analyze even from
very simple things like hey, give me the average of this
column, we all used to do that in Excel or why not, or
accounting records all the way to, well, I wanted to
run, I want to join these two data sets and I want to
run a specific computation on them and why not. So it
definitely has that breadth of functionality for very
basic data analyses to very complex data analysis.
Javier Luraschi: So that was one of the first features that we decided to
include in sparklyr to allow the community to easily
move their already existent data like basically analysis
that they can run in their own machine. Like if you
have a CSV file, just a text file or why not, or an Excel
file ... By the way you can use Excel with R if that's
what you like doing. So what dplyr allows you to do is
to say hey I'm going to use Excel spreadsheet to do
analysis so you can import it with read R and then you
can do data analysis on the dplyr. But then now with
the sparklyr and support for dplyr in the sparklyr you
can say, well instead of running this same analysis in
this Excel spreadsheet file, now I want to run it on like
10 terabytes of data or why not?
Javier Luraschi: It's the same thing. Which is kind of crazy. You don't
need to regret it. You don't need to worry a lot of those
things. You do need to pay for the computers, right?
We haven't talked about this but like still computing is
not free, right? So someone needs to pay for those
computers and what not. But at least as a user and if
you work in a company or you aspire to work on a
organization that is going to do like data analysis in
small scale and also in big scale, these tools allow you
to really easily jump from, hey, I just want to do
something quick and dirty while I'm on the bus, on my
way to work, and it works, it's great. And then like you
can just literally copy paste the same code and run it
against like a ginormous data set.
Javier Luraschi: Yeah, so that's kind of like how it started back in
2016. But since then, what we've seen is that the
Spark community keeps growing. They keep adding
new features-
Kirill Eremenko: They have to keep up.
Javier Luraschi: It keeps getting more and more interesting. Like for
instance, one of their conferences got renamed last
year I believe from Sparks Summit to Spark Plus AI
Summit. So like now spark is getting also intertwined
with deep learning and it's enabling a bunch of other
really cool interesting things to do. So I honestly don't
think that we're going to be able to call our work done
at least not on the next few months, right. I don't know
if it's one year, two years or another 10 years, who
knows? But definitely it's like a moving target. And I
think that's, that's exciting, right?
Javier Luraschi: 'Cause that means that you're not jumping into
something that started like 10 years ago and no one
uses anymore and why not try? It's like the opposite,
right? There's things that are very stable and you can
be very reliant on and use them at scale and we are
very confident that they're going to give you the
productivity results that you're looking for. And they're
sort of other things that are just like very, very new.
And they're exciting and they're probably going to get
there, but they're still like a moving target. So there's
definitely like interesting bits and pieces for the new
comers and for the very experts to be excited,
specifically about spark just 'cause we're talking about
Apache Spark, but also I think in general on Data
Science.
Kirill Eremenko: Gotcha. And while we're on top of a spark, I know we
have so many other things we can discuss, but I'm
really curious, Apache Spark 2.0 was released I think
last year, at the beginning of last year. What are your
thoughts on that jump from Spark to Spark 2.0, what
new stuff was added?
Javier Luraschi: Yeah. For sure. Well first of all I would say that that
was a big jump. Just to give some context, Spark 1.5
was like the first wave of people were starting to get
familiar with it and I feel like it really hit mainstream.
1.6 is still one of the biggest released versions that is
still in use today and the jump from 1.6 or 2.0
introduced many improvements and interesting new
features. For instance, one of the ones that I was
excited about was Spark structure at streaming. I
know it's a pretty long name but basically means real
data processing. And I feel like that's a really good
segue to talk a little bit about Big Compute, right?
'cause we only talk about this Big Data kind of part of
the story which is important and relevant. But then a
lot of companies don't have huge data sets, right?
Javier Luraschi: So what's the point of Big Data? Why do I need Spark?
Is this just hype and whatever? And you know like
some of that is true, right? Not everyone needs to Big
Data, but there's this whole other side of the coin
called Big Compute. And what Big Compute the way
that I like to explain it is just mostly making things
faster, right? So like when you have simple questions
like, hey, count how many records I have. Well that's
usually pretty fast and you don't have to worry about
that, right? But when you start asking questions like,
hey, could you please sort, arrange all these data set.
Well, that's a little bit harder, right? And then for those
people that are in the track of doing Data Science, and
again, I'm not an expert on Data Science, I'm a
software engineer, but like they will get more and more
familiar with a more complex models, right?
Javier Luraschi: So, some people might already be familiar with linear
regressions, which are a type of model that is pretty
common and pretty efficient and like a pretty good first
step towards modeling. But then you can
incrementally get harder and harder models. You want
to really fit the data correctly and why not? So a lot of
times what happens is like, well maybe I only have 100
megabyte data sets. Like well, that's obviously not Big
Data, right? But then you're running these models and
a lot of times what we see is we see data scientists just
waiting like an hour, right? Or it's like, well, I'm going
to go for a coffee 'cause this thing is just running for
the next two hours. And it's like, well, you know, that
might be a good case for Big Compute, right? Which
means, it only really means saying, hey, let's divide
these tasks into multiple computers 'cause even
though you don't have Big Data, wouldn't it be nice if
instead of waiting two hours we're going to give you
the answer in like 20 seconds, right? And he's like,
well-
Kirill Eremenko: Then there is no time for coffee.
Javier Luraschi: Okay, don't tell about your boss, but you can do it
faster and still go for coffee, but anyways. So definitely
there's the other side of the coin where you say like,
well, I don't care about Big Data or I don't have a need
for Big Data, but I want to make things faster. And
when you get to that point, it's like, well, how fast is
fast enough? I mean, you know, for you and me maybe
we don't have interest in data sets and we're like, well,
if you give me the answer in like 10 seconds that's
good enough. Who cares? That's fine. But there's a lot
of industries out there, like I'm just thinking of a stock
trading, right? I mean if I tell my boss, he's like, hey,
you know, I'm going to give you the answer in 10
seconds. He's like what are you doing?
Javier Luraschi: That's really not going to help me. So there's a lot of
use cases where you want to have instant feedback,
right? And some of the ways that we describe this is
with concepts like real time. We say oh, we want to
process data in real time, right? Meaning I don't want
to wait for whatever is being processed to finish
processing. I need the data right now. And there is
definitely a niche there of who really needs real time
versus who doesn't. But Spark structured streaming
enables those types of really fast execution models
that are very useful in some cases, that make a lot of
sense. And the way that they are tackling it, that we're
tackling with sparklyr and specifically streaming, it's
kind of like a very profound, interesting way. And the
credit goes to the Spark project. But the way that they
define Stream.
Javier Luraschi: So we're talking about Stream and we haven't even
really defined like, hey, what is Stream? Right? Well,
the way that we define what a Stream is in Spark is
like a table, like Excel spreadsheet, but when you open
Excel spreadsheet, you have a limited amount of rows,
right? Like you open it and you have, hey, I have like
20 rows or whatever two million. But you have like a
set number of rows. The difference with streams is
that we consider them us data sets that have an
infinite amount of rows, right? It's not true that they
have an infinite amount. But if you're looking at the
stock market and you were seeing like what's the price
of the Nasdaq every second. And if you try to see that
as a table, well it's a table, but I mean assuming that
the Nasdaq doesn't crash and disappears from planet
earth, that looks like an infinite data set, right? Of
data that just keeps coming.
Javier Luraschi: And it also have the quality that you want to process it
really, really fast. You don't want to wait of like, yeah,
I'll tell you what is a good prediction for the stock
market tomorrow. Right? It's like, well, I need it
instantly. So kind of like when you start looking at
those data sets, that data is coming really fast and it
never stops coming, structured streaming is like the
future that you kind of want to consider. So, that's one
of the newer features that I'm excited about. Doesn't
mean by any means that you need to get started with
Data Science and Big Data with streaming, right?
There's a million other ways to get started, but it's
definitely one of those features that it's pretty exciting.
I could talk here for hours, so just telling me kind of
like where do you want to kind of steer the
conversation.
Kirill Eremenko: I'm glad we touched on Big Compute and that it's a
part of Spark 2.0. So Big Compute is kind of, as I
understand it in this sense, is the ... Or the
predecessor for Big Compute is just paralyzed
computing. That's something I studied back at uni-
Javier Luraschi: Yeah it is-
Kirill Eremenko: About paralyzed processes.
Javier Luraschi: Yeah it is.
Kirill Eremenko: Okay. Gotcha, Gotcha. And Spark takes full advantage
and that sense that it's running in memory, right? Big
Compute happens in memory and not on disk drives.
Javier Luraschi: And we were talking about these verbs that exist on
Spark, that make it really easy to do operations, right?
Like, hey, I want to filter, I wanted to get the average, I
want to join these two data sets, right. Since you are
already familiar with parallel computing, you probably
also remember how hard it is, right? Or it used to be,
right? It's not trivial at all to say, well I have three
computers with small data set, now I want to do
calculation over these three things at the same time.
And it's like, well, it's actually pretty tricky.
Javier Luraschi: I wouldn't say necessarily that it's fully sold on Spark
'cause there's a lot of like, you know, if you're doing
genomic analyses are why not, right. You will probably
have to do customs things. But just talking specifically
about data analysis, that problem is well solved in
Spark. You want to have to worry about, hey, how do I
make these things run faster? You just explain it in
terms like similar things like what we were talking
about in our, we have the dplyr package. You can
arrange a pretty Big Data set, like this hundred
terabyte data set that we were talking about. You can
still sort that data set in 23 minutes just by saying
sort. Like sort parentheses, open parentheses, close
parentheses, what a Pi before that and why not.
Javier Luraschi: So, kind of like those things get ... Yeah and you're
right. Big Compute is not a new term. It's just
something that has been getting simpler and simpler.
And hopefully with time it will get even simpler. I don't
even know how that looks like 'cause I already think
it's pretty simple. Honestly we could also talk about
that 'cause I feel like some of the challenges today still
on, on distribute computing is about troubleshooting
when things are running at scale. Like in a lot of cases
things will run smoothly and you can run your
computation and be done with it. The reality today is
that, the harder computations you're doing, even
though it's easier to express what to do, there is still a
lot of like, hey, I need to troubleshoot. Why is this
machine failing? Does it have enough memory? My
hope is that in the future those things will get even
more and more automated?
Javier Luraschi: I'm just making stuff up at this point. This is almost
science fiction. But like it would be cool that before
you execute some data analysis and you say like, hey,
I wanted to sort these data set. It would be cool if the
tool would tell you in this case a Spark or sparklyr or
why not would tell you, hey, you're probably going to
need, like if you want to run it in one computer, it's
going to take you three hours. If you get up 10
computers with eight gigabytes of RAM each or
whatever, it's going to take this much.
Javier Luraschi: That part today doesn't exist yet. You need to figure
out like how big the cluster needs to be or how small
or get kind of like get some advice from your system
administrator if you work in a big company or a big
organization. But hopefully one cool thing that we
could do is just make it easier to say hey, we're going
to help you optimize what you need and we're going to
tell you. And then you can execute or why not. But
yeah, there's definitely like a lot of really interesting
work if other ... If a lot of people listening to your
podcast feel more on the software engineer track, I
would also encourage to explore kind of like those
areas.
Javier Luraschi: A lot of the questions that we get or I personally get
today to answer is I'm a software engineer, how do I
become a data scientist? And that's totally fine for
those people that want to become data scientists. But
there's also a lot of disciplines around data scientists
where people can apply their skills-
Kirill Eremenko: Like you. You're a software engineer working on Data
Science stuff all the time.
Javier Luraschi: Right. And I love software engineering and I wouldn't
change it, but it's surprising that I still find a lot of
very meaningful problems and interesting challenges
on Data Science without being per se a data scientists,
right?
Kirill Eremenko: Yeah. Gotcha. That's definitely a very interesting-
Javier Luraschi: Point of view I guess.
Kirill Eremenko: Yeah and career path that you've decided for yourself.
Javier Luraschi: I mean honestly I haven't thought about other career
paths, but I could see how someone that is doing
marketing or why not like could also take a focus. I
don't know exactly how that would look like. But you
could say, you know what, I like marketing but I want
to compliment it with Data Science and how do I really
apply marketing to Big Data and Data Science and
why not. So for those people that are kind of like by
curiously looking at Data Science or not I feel like
there's also strong career paths if you stay on
whatever you're doing, but put a strong focus on Data
Science or Big Data or Big Compute, why not? But you
stay where you're doing and that's probably also as
valid.
Kirill Eremenko: Very, very true. Before we move away from Spark, I
just wanted to ask you quickly from your experience
and from what you've seen in this space, how difficult
is it to learn Spark? You mentioned that it's quite
simple to use as in it paralyzes a lot of stuff for you.
You don't even have to think much about it, but in
general, what would you say, how long is the learning
curve from knowing some R programming and how to
do Data Science on a basic level in R to actually
knowing how to use Spark and querying Big Datasets
with Spark and with R. How long do you think that
would take?
Javier Luraschi: Well, so I would split this into two questions, right? So
if you're starting by your own and you don't have like
even a Spark cluster. If you're literally like finished like
if you're on your own, I feel like that could be
challenging 'cause it's like, well where do you even get
the computer? There's a lot of questions to ask and
how do you-
Kirill Eremenko: Amazon?
Javier Luraschi: Yeah, Amazon. So there's a lot of services out there.
There's Amazon, Amazon has service called EMR.
There's a service called Azure HDInsight, there's
Cloudera, there's Databricks. There's like 10 different
ones and I apologize for the ones that I didn't mention.
So there's definitely ways of getting started if you're on
your own. But I think that's usually not the case.
Usually what happens is, well there's ... We can break
it down again into two. One is like if you're learning, if
you're like, hey, I want to learn about R and I want to
learn about Spark. Well I have very good news for you
'cause it's actually super, super easy. In fact that's one
of the goals that I have that brings me the most joy is
to make it absolutely insanely easy for you to get
started with Spark and R. So you can get started very
easy. You download the sparklyr package by installing
it as any other R Package and then you literally run,
install_Spark, open parentheses, close parentheses,
run dot. It will install it for you and then you run
Spark connect.
Javier Luraschi: Don't try to do these from the actual podcast. But if
you want more information, go to spark.rstudio.com
and we'll help you get there. So it's totally real easy.
So, that's one case. If you're a student, definitely you
can do it and the barrier or the cost to enter is super,
super low. So give it a shot. The other way where you
might end up working with Apache Spark, which is
also really easy, is if you happen to end up working on
an organization that already has an Apache Spark
cluster, it is often the case that there's a cluster
administrator and there's someone whose job it is to
maintain that cluster or someone that is already
paying the bill with Databricks or Amazon or Google or
Azure why not, right?
Javier Luraschi: So if those clusters are already up and running and a
lot of times the data is already there, it also happens
to be very, very simple 'cause you don't have the
burden of setting that up, right? All you need to ask is,
hey, where's the cluster? What is the connection
name? If you listen to ... I'm almost sure that Nathan
probably talk about connection strings and why not,
right? All you need is a character string that tells you
where is the cluster. And so you basically put data into
Spark connect and that will get you up and running.
And cluster administrators are pretty good at helping
other employees or members of that community to get
up and running. So I think in both cases it's pretty
easy to get started.
Javier Luraschi: So those are I think just great news. There's almost no
excuse to not try it out. I would say that it is a linear
... Well, I would like to say that it's a linear learning
path. So it is not true that doing everything on Spark
is as easy, if you know what I mean. As we were
mentioning there's things that are very easy and newer
things like maybe Spark Structured Streaming or like
other topics. I work practically every day to make them
easier. But you kind of like want to start with small
steps which will get you very, very far regardless. But
then as you feel confident and feel proficient, you just
need to keep walking, right? It's just like a slope with
our [inaudible 00:51:38] client. Kind of like, if you're
like mountain climbing or what not or kind of like that
type of deal where you're like well, you start going, it's
not that hard and you definitely at some point you're
going to be hitting harder problems.
Javier Luraschi: But the very nice thing is that to get started it takes
very little. And I think that's important 'cause if it take
a really long time to even do simple things or why not,
it's like no one has time for that. But if it gets you up
and running and you can do meaningful analysis very
easily, which is where we are today, it's very easy to
get us started. It's very easy to learn. And then as your
problems get harder and harder and you are
answering more interesting questions that are
interesting to you or that bring value to whoever you're
working for or with, it just makes, makes it easier that
you just need to keep working there and there's, I
didn't mention it but there's a great community in R.
My guess is that some of the other R speakers here
mentioned it, but there's a very nice, warm community
around R and specifically also about sparklyr and why
not.
Javier Luraschi: So you can go to resources to community.rstudio.com
and just ask for help. You can also go to our GitHub
page and look at the issues and if something feels like
that you really need help with, you're going to open an
issue. So you're also not alone in this and there's a lot
of resources to get you up and running. So I definitely,
at least will encourage everyone to, if they're curious
about Big Data and Big Compute and Cluster
Computing, give it a shot, 'cause you're going to be
surprised. That is something that feels doable. Which
would not have been the case few years ago.
Kirill Eremenko: Fantastic. Thanks so much Javier for that description.
Hopefully that will encourage more people to jump into
learning Spark, especially through R. It sounds like
you guys really make it happen. Make it easy and
those who are already using sparklyr like Javier he's
making things I like for you in the space. Well there's
so many more things that I wanted to ask you about
Big Data, about your journey into this career and so
on. Unfortunately we're running out of time and
probably we only have time for one more thing. And I
thought out of all the things that I have written down,
I thought the most important one would be your book.
Kirill Eremenko: So you mentioned before the podcast you're working
on a book and seeing how much value you're giving on
this podcast, how passionate are you about the space.
I think it would be a shame for listeners to hear that
you and some of your colleagues or some of your
friends you're working on a book and that is going to
be published next year. So tell us a bit about that.
What is this book going to be about so people are
interested in this space can look forward to it.
Javier Luraschi: Right. So the name of the book is going to be
somewhere around the lines of, the R in Spark. Kind of
like funny name the R, what is the piece that we're
going to highlight from Spark or R, the R programming
language. So we already have a website and we're
going to put more information when the book is
published, but the website is, therinspark.com. So it's
pretty straightforward. But for now it's a bit of a
placeholder, but if you're interested at least it's good to
keep in your bookmarks. But my goal with that book is
to make it, well both with sparklyr and with the book
is to make it the absolutely easiest way to get started
with Apache Spark. So anything from, hey, what is
this thing which we already ... Your listeners are lucky
enough to have you that they already know, kind of
like what's spark right and they got a very nice
introduction.
Javier Luraschi: So the goal of this book is to make it very, very easy for
anyone that opens the book to say, wait, [inaudible
00:55:37] is the Big Data? Oh, that's what it is. What
is spark? Oh, now I get it. Okay, now that I
understand it, how do I get started? It's going to be a
very gentle introduction. But being gentle, it's not
going to remove the fact that if you go through the
whole book we want to take you from being a very new
user to being close to being an expert. And like any
book, you need to do the exercise and practice and
why not. But we're definitely hoping that we can bring
a lot of people to the Apache Spark community as well
and just help them out. So yeah, definitely if you want
to keep in touch, therinspark.com might be the place.
Kirill Eremenko: Awesome. I'm actually looking at it right now and
there's for our listeners, if you're interested in what
Javier was talking about today and you found it
exciting to listen to R in Spark, if you go to the
website, therinspark.com, there's a way you can get
early access to the book. So you just have to send an
email to Javier and you'll get early access. I think that
would be pretty exciting to get early access to the
content. So yeah, if you guys are interested, jump on
top of that. Javier, we're out of time. Thank you so
much for coming on this show. Being an amazing
journey just listening. Totally, totally captivated too.
You have so much passion for this space. Before I let
you go, I have to ask, what are some of the best ways
for our listeners to get in touch with you to follow up?
You mentioned the R in Spark, what other ways can
our listeners get in touch with you?
Javier Luraschi: Yeah. I would say Twitter, but I'm honestly so bad at
Twitter. I need to listen for 10 tips to be a Twitter user.
So Twitter, definitely you can find me there Javier
Luraschi and that's just Twitter, first name, last name.
I'll do my best to answer there. But definitely on the
GitHub page and we also have a Gitter channel. So it's
going to be pretty easy if you start looking at the
sparklyr, if you search for sparklyr, one way or
another one, you're probably going to end up reaching
myself and other colleagues in RStudio. So I would
just say just don't be afraid, whichever form you find
of contacting us, there's a Gitter channel where we can
chat. There is Twitter if hopefully I get better at using
it. There is GitHub, there is the book that will take us
a few months to get it to you. But just in general,
whichever way you can find us, just feel free to keep in
touch and I'm looking forward to that.
Kirill Eremenko: Awesome. And Linkedin, is it alright for listeners to
connect on Linkedin?
Javier Luraschi: Yeah. For sure. It's first name, last name altogether. I
don't think they have like a nice search in there, but I
my first name Javier, last name Luraschi. Twitter as
well. You can find me there. I don't know if I miss
anything else, but I can give you my address if you
want to. Just kidding.
Kirill Eremenko: And your social security number and where the money
is.
Javier Luraschi: That's for sure. Yeah, we'll put it out there.
Kirill Eremenko: Okay. No, I think that's all. Well, once again, thanks
so much Javier. Good luck with the book. Looking
forward to it coming out and I'm sure a lot of people
are going to get from this podcast. Thanks so much.
Javier Luraschi: Well, Kirill thank you so much for having me and
grade work with this podcast. I'm really happy I had
the chance to be with you today, here.
Kirill Eremenko: So there you have it. That was Javier Luraschi from
RStudio. Hope you enjoyed this session as much as I
did personally. My biggest takeaway and fairest was
the whole fact that we dove into Apache Spark so
deeply and got to know this space so well from
firsthand, from a person who actually works in
developing a package to work with Apache Spark. So
Javier is up to date with all of the changes in Apache
Spark and he knows exactly everything that's
happening in this space. So it was really great to hear
this information, these insights directly from him. And
I'm sure you can also feel the immense passion than
Javier has for the space and in fact it's even
contagious, so I'm sure if you have never heard of
Apache Spark before, now you can feel that this is one
of those really powerful tools that maybe one day you'll
add to your Data Science toolkit.
Kirill Eremenko: On that note, you can get the show notes for this
episode at www.superdatascience.com/211. There you
will also find all the things that we mentioned in the
episode. All the materials we mentioned in the episode,
a URL to connect with Javier and follow him and his
career on all social media. You'll also find a link to the
upcoming book, to the website where you can register
to get some contents of Javier's upcoming book, which
is going to be pretty awesome based on what we heard
today. And of course, if you know anybody interested
in the space of Big Data, in Apache Spark who wants
to learn more, and who would be as excited about this
episode as you hopefully and of course as I was on
today's show, then please forward them this link. Help
spread the word, help other people learn these topics.
Apache Spark is a really cool tool that is helping data
scientist's work of Big Data. So let's help each other
out. Send this episode to anybody who you think
might benefit from it. Whether it's a friend, colleague,
family member or somebody that you just know that
this will help them out.
Kirill Eremenko: On that note, thanks so much for being here today
and sharing this hour with us. Can't wait to see you
back here next time, and until then, happy analyzing.