sds podcast episode 211 with javier luraschi · 2018. 11. 22. · have already incontinent the dark...

SDS PODCAST

EPISODE 211

WITH

JAVIER LURASCHI

http://www.superdatascience.com/211

Kirill Eremenko: This is episode number 211 with software engineer at

RStudio Javier Luraschi.

Kirill Eremenko: Welcome to the SuperDataScience podcast. My name

is Kirill Eremenko, Data Science Coach and lifestyle

entrepreneur, and each week we bring you inspiring

people and ideas to help you build your successful

career in Data Science. Thanks for being here today

and now let's make the complex simple.

Kirill Eremenko: Welcome back to the SuperDataScience podcast.

Ladies and gentlemen, I'm very excited to have you on

the show today and today I've got a very interesting

guest Javier Luraschi on the show with us. Javier is a

Software Engineer at RStudio. If you use RStudio, you

might know a couple of packages that he's worked on

or and/or co-authored packages such as sparklyr,

mlflow, cloudml, and many more. And in fact, if you,

even if you're just a beginner in our, you probably

have already incontinent the dark theme in RStudio

and that's something that's Javier has also

contributed towards. So we've got a very exciting

podcasts coming up for you just now with Javier. We

talked a lot about Big Data and Big Compute and

specifically we talked a lot most of this podcast is

dedicated to Apache Spark.

Kirill Eremenko: So in this podcast you will find out it's the whole

history of Big Data where Hadoop came about from

why Apache Spark was created, how they compare,

what a punch just park is used for, how it's developed

over time, how it's developing now. You will also learn

a lot about package development in RStudio and some

of the exciting things that are happening in this space.


And in addition to all events, you will feel a lot of

passion. Javier has a ton of passion for this space, for

RStudio, for a Apache Spark, for developing packages,

for Big Data and Big Compute. So this podcast is full

of that. I was personally sitting on the edge of my seat

just enthralled by all these amazing stories that Javier

is telling and this whole space of Big Data.

Kirill Eremenko: So all in all, an incredibly exciting, captivating

podcast. Can't wait for you to check it out. So without

further ado, I bring to Javier Luraschi who is a

Software Engineer at RStudio.

Kirill Eremenko: Welcome Ladies and gentlemen to this

SuperDataScience podcast, today I have a very special

guest calling in all the way from Seattle Javier

Luraschi from RStudio. Javier, how are you going

today?

Javier Luraschi: I'm doing good Kirill. How about you?

Kirill Eremenko: I'm doing well as well and I'm in Australia right now in

Gulf Coast. The weather is quite warm here, unlike

Seattle. You said it's getting a bit cold there?

Javier Luraschi: Yeah, I'll probably have to go and visit you one of these

days 'cause definitely we can use some of your sun.

Kirill Eremenko: Yeah, for sure. And you mentioned winters are quite

harsh and saddle, like how cold does it get?

Javier Luraschi: It's not that it gets that cold about, it gets cold minus

five Celsius, which is not terrible, but we have very,

very long winters. So definitely expect a visit from me

somewhere around March or May. I'm assuming

Australia has pretty nice weather most of the year.


Kirill Eremenko: Yeah. Most of the year. It's quite good. The only thing

is that it's a bit like first time I got here I was not really

expecting that during winter it's warm during the day,

but then at night the temperature drops to like maybe

plus eight degrees Celsius, which is fine. But the thing

is they don't have central heating in the buildings. So

the building is actually cold and you have to get

blankets. So that was a bit unexpected. But other than

that it's really cool.

Javier Luraschi: I would say that at least in winter it's a time where you

can be very productive because you know, like it's

dark and everything and it's cold. So it's kind of like

puts you in more to just get things done which is good.

I kind of been on warm countries and sometimes

getting work done in warm countries gets trickier.

Kirill Eremenko: I know. You work faster when it's colder.

Javier Luraschi: I believe that's for me, but I don't know if that's the

same for everyone.

Kirill Eremenko: Same for me. All right. Well Javier RStudio. You're on

the podcast. It's really cool to have you. As I

mentioned, as I mentioned before the podcast already

spoken to Nathan Stephens, Nisha Iyer. I've recently

just been chatting on Linkedin to Jonathan Regenstein

all from RStudio. You have such a fantastic team

there. This is incredible. Every person I met from

RStudio is like some sort of very interesting genius in

their own field and I learned so much. I'm really

looking forward to this podcast. I got really high

expectations for today.


Javier Luraschi: Well we appreciate that, but like it's definitely fun. I

don't know about being all of us genius but definitely

it's fun. Especially sine we're a distributed company, I

don't even know honestly where Jonathan is these

days, but whoever you talk with RStudio it's likely that

they won't be on the same city 'cause we're just all

over the US also Europe and yeah.

Kirill Eremenko: It's crazy. How is it to work in a distributed company

like that?

Javier Luraschi: I really love it honestly. I do feel like it's different. I feel

like any other new job or any change in your life

usually have a honeymoon period right? Where you ...

Even Data Science, right? I bet you start with Data

Science and you're like, oh my God, this is great. This

is the best thing. For the most part. I think working in

a remote team has been that way. I do think, and I

would give us advice or unsolicited advice to your

listeners, right? There's definitely like a period where

you get, you have to tweak your personal life to also

maximize all of the the things that are not work

related, if you know what I mean, right. So like when

when you're in the same office it's very easy to have

those personal interactions and keep up with people

and what not and when you're in a distributed team

you kind of lose the personal face to face connection

unless you practically say hey I want to catch up with

these colleagues or friends that I haven't seen right?

Javier Luraschi: But net net, I think it's fantastic. I don't even know

what a community looks like and it's nice be close to

your family and what not so I definitely highly

recommend positions and remote teams.


Kirill Eremenko: That's so cool. That's so cool. And I totally agree with

the whole concept losing on the personal side of things

you need something you do a take care of. Like for

instance, at SuperDataScience, we have a remote team

and we all catch up, we make an effort to at least once

a year to all get into the same place and spend a week

together. And the other thing is like every week we

have a new body. So everybody in the team randomly

gets assigned somebody else so it could be like a

director. It could be an analyst together and for the

week, no, not for the week, for the month you guys are

buddies and so you plan your weeks together, you

catch up once a week and that really puts people from

different parts of their company closer to each other.

So I think you're right, the personal side of things.

Javier Luraschi: Yeah. So I'm actually curious like what does that

mean? 'Cause you're still remote, right? Does that

mean that you get to talk with this person in Slack or

whatever you're using or?

Kirill Eremenko: So Slack, you can talk to anybody anytime, but that

means that at least once a week you need to catch up

and talk to that other person in person. So we use

Zoom, so not like in person as in-

Javier Luraschi: Yeah, we Zoom as well.

Kirill Eremenko: But like on video you need to catch up with that

person at least once a week at the start of the week

when you're planning your week and you spend an

hour together, and so you have already chatting about

the weekend, what are you doing in your personal life?


And feedback's been really good. People love to find out

more about that.

Javier Luraschi: Yeah connecting with other colleagues and what not.

That makes a little sense.

Kirill Eremenko: That's cool. So tell us, how did you get into RStudio in

the first place? Before we continue, I just wanted to let

everybody know. So Javier is a software engineer at

RStudio. Maybe let's start there. Before we go into the

story of how you got into RStudio, tell us a bit about

what exactly do you do at RStudio?

Javier Luraschi: Sure. Well, I mean, as you mentioned, I'm a software

engineer, so I write software, right? That's kind of like

obvious. But I mostly have been focusing on R

Packages. For those of you that might not be super

familiar with what an R Package is, is basically our

code, which is just R is programming language, right?

And you package this code into modules that you can

share with other people, right? So this is actually, it's

a concept that you can find in other programming

languages, but I think what they R community, has

going is pretty special. 'Cause there's a very nice

relationship between R Package and an actual person

that lives and breathes in real life, right? So a lot of the

functionality that you use from R is going to come

from packages and those packages have maintainers

and authors. And at the end it's just R code, but it's

packaged in such a way that it makes it very easy for

you to reuse and for us maintainers to also keep up

with new releases and new features and why not.


Javier Luraschi: So yeah I do R Packages and specifically I do packages

that are mostly related with making R run faster or

upscale or helping you share models with other people.

So R Packages can basically do anything you want and

the ones that I focus on are mostly about helping you

work with Big Data and Big Compute and specifically

with a package called sparklyr.

Kirill Eremenko: Interesting. That's a very interesting description. I

thought that R Packages mostly come from CRAN and

that they are created by the community. I didn't know

that the R studio on purpose creates R packages. Can

you tell us the difference between those two please?

Javier Luraschi: Well, they're exactly the same. They're just two

different parts of the story, right. So basically an R

Packages it's literally like you could see it as a box

with code inside it and CRAN is the store all the

packages are free, right. But it's where you go and get

the actual packages. So there's people that build these

packages in the art community, they don't need to

work at RStudio. And honestly most of the packages

that are on CRAN are not from RStudio at all. So

anyone can package a useful piece of code into an R

Package and then make it available on CRAN. And

then anyone from the community can go to CRAN

search for packages and then install them, right. So

think about CRAN as the APP store, right? Whenever

you go to the APP store, you look through apps and

then you install them. CRAN is kind of like the APP

store from the R world. And some of the packages very

few are developed by RStudio. And many, many, many


otters are developed by other people in their

community.

Kirill Eremenko: Okay. Gotcha. And in that case, how does RStudio

decides what the community is going to develop and

what is going to be developed within RStudio, like for

instance, within your team and just like within the

company? Is there kind of some strategy behind that?

Javier Luraschi: Yeah. I think there's some honestly I don't think

there's an explicit strategy, like this is my own

personal point of view. So I wouldn't speak completely

on behalf of RStudio. But the way that I see it is that if

we see someone that is already working in a package

and it's a great package in the community, we just

simply don't work on it, right? We're very pragmatic in

the way we try to approach problems. So we try to look

for the things that are the most painful and the most

impactful that the community might need. And if we

look at a problem, right? Like in my case, like, hey, Big

Data, right? If someone in the community already

providing that and if the answer is no and there's

enough big impact that we can help the community,

we basically help.

Javier Luraschi: That's kind of the very high level, like, hey, this is kind

of how we see it. 'Cause yeah, I think it could be

possible that if they R community would take over the

entire packages, there wouldn't be a need for RStudio

perhaps to develop packages, right. But something

that perhaps, I don't know if you already talked about

this on the series is, but it is like the opportunity that

exists in Data Science in general, and in my particular

case, Big Data is so huge that, we need more people,


right. So it's not that we're fighting for packages which

is, I think it's a great place to be at. It's mostly like, oh

my God, Data Science is growing so much and there's

so much to be done that, let's just make sure that we

don't overlap because that's just an efficient. But

there's so much opportunity everywhere that we

haven't had that problem of hey, having to do a lot of

coordination or why not. It's mostly pretty obvious

when there's a gap or an opportunity that someone

can help with.

Kirill Eremenko: Gotcha. Okay. Very interesting role and lots of

packages. So maybe let's talk a little bit about some of

the recent development. So you mentioned Big Data.

Can you tell us a bit about Big Data, maybe even give

a quick overview for new listeners on the podcast who

are not familiar with the cost of Big Data. What is Big

Data and Big Compute? What's the story that's

happening there?

Javier Luraschi: No for sure. I honestly, I really love those two concepts,

Big Data and Big Compute and mostly I really like

them 'cause there's people that have very strong

feelings one way or the other one and you probably

have seen these around. So kind of to me the way that

I like to introduce them, it's just from a historical point

of view. Maybe that's the most boring point of view,

but I think it's also pretty exciting, right? So if you

think, like to me one of the most exciting things about

in general Data Science and Big Data is if you ask a

historian, right? Someone specializing in history, he's

like hey, how do you divide human development,

right? The development of the human civilization. It's


going to be pretty likely that they will mention things

like the stone age, pre industrial revolution, the

industrial revolution and why not.

Javier Luraschi: And a lot of them will mention also the information

age, right? I'm pretty sure that you might have already

heard this term, right? The information age and this

kind of concept basically means that there's a lot of

information that we're creating that is digital, right?

And if you look at one report from the World Bank,

which basically creates reports about how the world is

changing every year. One report that they had on 2016

kind of like try to analyze how much physical data we

have that's called analog data in some ways, like books

and paper and anything that leaves a paper trail and

why not. And also digital information, right. And you

can see in the report it's just growing an exponential

growth rates. It's just we don't have to explain this too

much, but if you look at how many cat videos we

create per day, right? Or Instagram photos or

Facebook photos or just data in general.

Javier Luraschi: We're creating a lot of information. So back in the year

2003 around the time, there were companies like

Google that were just starting, right? It's crazy, but

google is a pretty new company, right? It wasn't here,

what is that? 30 years ago or why not? And what

Google was trying to do is say, hey, we have the

worldwide web and there's a lot of information that

seems to usable. Can we make it searchable? And

there were obviously companies before them like

Yahoo and Excite and AltaVista for those of you that


remember, but Google was one of those companies

that said, how can we make search better?

Javier Luraschi: And the first program that you hit when you're talking

about the web, it's like you can't really put the whole

web in one computer, right? It just too much data on if

you try to put it in one computer, like it just doesn't

make any sense. So the way that they solved the

problem back then was by using multiple computers,

right? You have multiple computers, each with their

hard disk across many computers you can load the

entire web and make searches and that's basically

what gave birth to what's called a HDFS and which I

believe has been maybe mentioned already in your

podcast. Which is basically a way of splitting data into

multiple machines. Right. So that's kind of like the

beginning of Big Data is like, hey, if you have

something that doesn't fit in one machine, you have

Big Data and it's a pretty clear definition. There's

obvious problems that fall into this category, right?

Like, hey I have my computer, I'm doing analyses, I

can't have all this data in my machine, I need multiple

computers. Well now we're talking about Big Data.

Kirill Eremenko: And HDFS e Hadoop Distributed File System?

Javier Luraschi: Yeah, it's basically the Hadoop Distributed File

System. Basically the way that this started is Google

had a research paper where they explained to the

world, hey, we have the Google file system, right? And

we have a bunch of files, this is how we distribute

them and then some engineers in India who they said,

well, this sounds like a great idea, can we make it on

open source project? And that's where Hadoop came


from. Is this open source implementation of the

internal file system that Google made available to a

research publication, right. And let me know how this-

Kirill Eremenko: Sounds great [crosstalk 00:18:44].

Javier Luraschi: Yes, it's great okay. I want to make sure that I give

enough information but not too much.

Kirill Eremenko: No. I love it. You're totally right. It's the historic

context makes it even more exciting. Makes it like a

story, like the journey. So yeah, no, I'm just listening.

I'm really immersed in your explanation, please

continue.

Javier Luraschi: Well, yeah for sure. So far so good, right? Google was

doing their own thing and we have open source project

like Hadoop that were doing their own thing and as I

mentioned, they were mostly based on disk drives

right? Disk drives even today can store the most

information for the lowest cost, so it makes sense to

put information there. But then a different project

came out around, I want to make sure I get this right,

but I believe it was around 2009 there was a project in

Berkeley to kind of improve over Hadoop. Said, well,

Hadoop is great we're doing all these things based on a

file system based on disk drives, but can we do it

better? Can we do it faster? And sure enough, another

open source project started in Berkeley called Apache

Spark. And the premise at the point in time was, or

one of the things that changed from their precursor

was that, there was a trend in which memory RAM

computer memory, was getting cheaper.


Javier Luraschi: Not as cheap as disk drives, but it was getting a bit

cheaper, right. And it also happens to be the case that

your computer memory makes your computer pretty

fast, right? So whenever you go and buy a new

computer, the amount of memory is one of the things

that you want to go and check, oh, it has like four

gigabytes or eight gigabytes or 32 or whatever. It

makes a big difference on speed 'cause it means that a

computer can handle more things in a faster and

easier, way to access it, right. Anyways, so basically

the Apache Spark project started by figuring out how

to create something like Hadoop and improve over it

based on memory.

Javier Luraschi: And what they found out is that sure enough you can

get significant speedups by using memory instead of

disks. And one of the things that we like to do as

software engineers is sort data, sounds like such an

easy task just to order data. If you had a list of names

of people and you just want list them on alphabetical

order, it happens to be a pretty compute intensive

task, especially when you have a lot of data. Anyway,

so what they found out back then is that it used to

take to sort 100 terabytes, which is not a lot of data. It

used to take Hadoop 72 minutes and 2100 computers.

So, that was with Hadoop. And then with this new

framework based on memory and with have a more

rich vocabulary of operation that you could do, you

could do the same in 23 minutes but only use in 200

computers.

Javier Luraschi: So that's crazy, right? It's an improvement of 10X

performance, you need 10 times less computers and


you actually make it ... You can sort the information

faster. So it was a really, really big deal back then. And

in some ways it's still part of the main trend of trying

to figure out as a society and as humanity, how do we

order, how do we make sense of all this information,

right? How do we arrange it? How do we store it? How

do we answer harder and more interesting questions

over and over? So, that's pretty much what Spark is.

It's an in memory engine that allows you to process

any information. One of the things that is different

perhaps with Hadoop is that when you talk about

Hadoop, there was only support for one type of

operation called map reduce. And maybe this is also

something that people have heard around when

they're getting immersed into a Data Science and Big

Data, right? It's like map reduce

Javier Luraschi: Well, back when Hadoop was designed, there was

another paper from Google where they proposed this

model. Whenever you want to transform information, I

guess I should say that so far we've talked about

storing information on retrieving it. We haven't talked

enough on okay. Well information is across multiple

computers. How do you actually make it do something

meaningful if it's distributed across multiple

computers, right. So the first attempt to solve that,

which also was at around the time a couple of years

after Hadoop came to be, was another paper from

Google who called MapReduce, which basically

explained the world, say, hey, if you have a distributed

system, that basically means you have a lot of

computers, you can reduce all the operations to


mapping meaning transforming some information in

the same machine to some other information in the

same machine, and then combining the information

between machines.

Javier Luraschi: That's kind of like at a high level what it is. And it was

pretty good at that time. But it was also constraining

in the sense that it didn't provide a lot of verbs or kind

of like grammatical constructs to make coding more

easy, right? It was pretty bare bones at the time. And

one of the other big improvements from Spark is that it

enabled big vocabulary of operations to make Big Data

more easy to analyze, right? So instead of just saying,

hey, you need to tell me how to map data on each

machine on then aggregate it, it introduced things like,

hey, just tell me what you want filtered. What do you

want to the average to be like. Kind of like more things

that are now they feel closer to what data analyses.

Like, hey, I just want to count how many earthquakes

are available in these data sets. That's a pretty simple

question that used to be actually pretty tricky to ask

back when we only had MapReduce.

Javier Luraschi: So with Spark you can say things like, hey, from all

these data set, give me all the earthquakes that just

happened in Australia on this period and give me the

count. So that's a much more reach way of expressing

and analyzing data when it's distributed across

multiple machines that previous technologies didn't

really ... It wasn't as simple. You had to provide code

for you to process each disk drive on each machine

and then figure out how to upgrade and all that. So

yeah it was a big deal and it's still a big deal. I mean, it


has been less than 10 years. So if you look at the

Apache Spark project, I mean, we can talk about more

of the things that have happened and why not. It's still

changing a lot. So it's by no means what I would

consider it done. And it's growing everyday.

Kirill Eremenko: Gotcha. So if I understand correctly, Apache Spark not

only, is faster than Hadoop in the sense that it works

in memory, uses RAM instead of the disk data, but it

also is simpler to use to operate. Is that correct?

Javier Luraschi: Yeah. That is completely correct. And the thing to say

here, it sounds really easy and it is really easy

compared to dealing with MapReduce. But you're still

talking about hundreds of machines or at least tens of

machines. So it's a pretty hard problem and it's

surprising the amount of progress that the open

source community has made in the past 20 years.

Where now really anyone, if we want to get into

specifics of hey, how do I get into these? Really anyone

can download the tools that they need in less than 15

minutes and get up and running and start doing data

processing. That was really state of the art in

companies like Google or Yahoo like 15 years ago. So

that thing is just kind of like the technical explanation

of why Data Science and Big Data, it's such a big deal.

Javier Luraschi: It is 'cause these tools are so easy today than what

they were before that we honestly haven't figured out

yet how to fully apply them and we can maybe talk a

little bit about this. But just talking about potential.

It's like, well Google and Yahoo and other big

companies were using these tools and they were great

and now they just became so easy to use. What can we


do with them? And I'm sure a lot of your listeners will

have more particular ideas, but it's definitely

fascinating.

Kirill Eremenko: Yeah for sure. And do you have certain packages in R

which to allow people to work with Apache Spark?

Javier Luraschi: Yeah, that's a great question. And the answer is yes.

And the package that I recommend using, it's called

sparklyr. It's kind of like a corny name out of Spark.

You have Spark this is like sparklyr or sparkly R.

Yeah, so we try to name our package in fun ways

'cause it doesn't make sense to name them with boring

topics. So yeah, it's called sparklyr and it's basically

the way that you use Spark from the R programming

language, right. So R is obviously a pretty well known

programming language or computing language I would

want to say-

Kirill Eremenko: Very well known. My brother is studying at uni now

and he's actually using R. It's really going well.

Javier Luraschi: There you go. So if you're already have an age for

learning R, it's another natural progression to say well,

what about things like Big Data? And we haven't even

talked about Big Compute so we need to get back to

that. But yeah, definitely if you want to get involved

into doing things in cluster computing with Big Data

and why not, this R Package called sparklyr is a very

nice way of getting started. And again in this

particular case, it's a package that RStudio developed,

myself and other people in RStudio are authors of

these package. But it's actually an Apache licensed

open source project. So it's completely free. It's


available in CRAN, you can download it, it's easy to

install. And same for Spark. I didn't mention that, but

Spark also happens to be an open source project, it's

Apache licensed. So it's pretty much there for anyone

to use that has that need or interest.

Kirill Eremenko: That's very cool. That's very cool. How long does it take

you guys to do all the packages? Just out of curiosity?

Javier Luraschi: Well in all honesty, I feel like we're still working on it

and we're going to be working on it for a while and. So

the original package we worked for a few months in it

back in 2016. We actually announced it back in use R

2016. It was like the base basic functionality which

included things like being able to do data analysis with

dplyr which it's one of the most used packages in the

R community, which allows you to basically design

data manipulation operations through our grammar.

That makes it very easy to let you analyze even from

very simple things like hey, give me the average of this

column, we all used to do that in Excel or why not, or

accounting records all the way to, well, I wanted to

run, I want to join these two data sets and I want to

run a specific computation on them and why not. So it

definitely has that breadth of functionality for very

basic data analyses to very complex data analysis.

Javier Luraschi: So that was one of the first features that we decided to

include in sparklyr to allow the community to easily

move their already existent data like basically analysis

that they can run in their own machine. Like if you

have a CSV file, just a text file or why not, or an Excel

file ... By the way you can use Excel with R if that's

what you like doing. So what dplyr allows you to do is


to say hey I'm going to use Excel spreadsheet to do

analysis so you can import it with read R and then you

can do data analysis on the dplyr. But then now with

the sparklyr and support for dplyr in the sparklyr you

can say, well instead of running this same analysis in

this Excel spreadsheet file, now I want to run it on like

10 terabytes of data or why not?

Javier Luraschi: It's the same thing. Which is kind of crazy. You don't

need to regret it. You don't need to worry a lot of those

things. You do need to pay for the computers, right?

We haven't talked about this but like still computing is

not free, right? So someone needs to pay for those

computers and what not. But at least as a user and if

you work in a company or you aspire to work on a

organization that is going to do like data analysis in

small scale and also in big scale, these tools allow you

to really easily jump from, hey, I just want to do

something quick and dirty while I'm on the bus, on my

way to work, and it works, it's great. And then like you

can just literally copy paste the same code and run it

against like a ginormous data set.

Javier Luraschi: Yeah, so that's kind of like how it started back in

2016. But since then, what we've seen is that the

Spark community keeps growing. They keep adding

new features-

Kirill Eremenko: They have to keep up.

Javier Luraschi: It keeps getting more and more interesting. Like for

instance, one of their conferences got renamed last

year I believe from Sparks Summit to Spark Plus AI

Summit. So like now spark is getting also intertwined


with deep learning and it's enabling a bunch of other

really cool interesting things to do. So I honestly don't

think that we're going to be able to call our work done

at least not on the next few months, right. I don't know

if it's one year, two years or another 10 years, who

knows? But definitely it's like a moving target. And I

think that's, that's exciting, right?

Javier Luraschi: 'Cause that means that you're not jumping into

something that started like 10 years ago and no one

uses anymore and why not try? It's like the opposite,

right? There's things that are very stable and you can

be very reliant on and use them at scale and we are

very confident that they're going to give you the

productivity results that you're looking for. And they're

sort of other things that are just like very, very new.

And they're exciting and they're probably going to get

there, but they're still like a moving target. So there's

definitely like interesting bits and pieces for the new

comers and for the very experts to be excited,

specifically about spark just 'cause we're talking about

Apache Spark, but also I think in general on Data

Science.

Kirill Eremenko: Gotcha. And while we're on top of a spark, I know we

have so many other things we can discuss, but I'm

really curious, Apache Spark 2.0 was released I think

last year, at the beginning of last year. What are your

thoughts on that jump from Spark to Spark 2.0, what

new stuff was added?

Javier Luraschi: Yeah. For sure. Well first of all I would say that that

was a big jump. Just to give some context, Spark 1.5

was like the first wave of people were starting to get


familiar with it and I feel like it really hit mainstream.

1.6 is still one of the biggest released versions that is

still in use today and the jump from 1.6 or 2.0

introduced many improvements and interesting new

features. For instance, one of the ones that I was

excited about was Spark structure at streaming. I

know it's a pretty long name but basically means real

data processing. And I feel like that's a really good

segue to talk a little bit about Big Compute, right?

'cause we only talk about this Big Data kind of part of

the story which is important and relevant. But then a

lot of companies don't have huge data sets, right?

Javier Luraschi: So what's the point of Big Data? Why do I need Spark?

Is this just hype and whatever? And you know like

some of that is true, right? Not everyone needs to Big

Data, but there's this whole other side of the coin

called Big Compute. And what Big Compute the way

that I like to explain it is just mostly making things

faster, right? So like when you have simple questions

like, hey, count how many records I have. Well that's

usually pretty fast and you don't have to worry about

that, right? But when you start asking questions like,

hey, could you please sort, arrange all these data set.

Well, that's a little bit harder, right? And then for those

people that are in the track of doing Data Science, and

again, I'm not an expert on Data Science, I'm a

software engineer, but like they will get more and more

familiar with a more complex models, right?

Javier Luraschi: So, some people might already be familiar with linear

regressions, which are a type of model that is pretty

common and pretty efficient and like a pretty good first


step towards modeling. But then you can

incrementally get harder and harder models. You want

to really fit the data correctly and why not? So a lot of

times what happens is like, well maybe I only have 100

megabyte data sets. Like well, that's obviously not Big

Data, right? But then you're running these models and

a lot of times what we see is we see data scientists just

waiting like an hour, right? Or it's like, well, I'm going

to go for a coffee 'cause this thing is just running for

the next two hours. And it's like, well, you know, that

might be a good case for Big Compute, right? Which

means, it only really means saying, hey, let's divide

these tasks into multiple computers 'cause even

though you don't have Big Data, wouldn't it be nice if

instead of waiting two hours we're going to give you

the answer in like 20 seconds, right? And he's like,

well-

Kirill Eremenko: Then there is no time for coffee.

Javier Luraschi: Okay, don't tell about your boss, but you can do it

faster and still go for coffee, but anyways. So definitely

there's the other side of the coin where you say like,

well, I don't care about Big Data or I don't have a need

for Big Data, but I want to make things faster. And

when you get to that point, it's like, well, how fast is

fast enough? I mean, you know, for you and me maybe

we don't have interest in data sets and we're like, well,

if you give me the answer in like 10 seconds that's

good enough. Who cares? That's fine. But there's a lot

of industries out there, like I'm just thinking of a stock

trading, right? I mean if I tell my boss, he's like, hey,


you know, I'm going to give you the answer in 10

seconds. He's like what are you doing?

Javier Luraschi: That's really not going to help me. So there's a lot of

use cases where you want to have instant feedback,

right? And some of the ways that we describe this is

with concepts like real time. We say oh, we want to

process data in real time, right? Meaning I don't want

to wait for whatever is being processed to finish

processing. I need the data right now. And there is

definitely a niche there of who really needs real time

versus who doesn't. But Spark structured streaming

enables those types of really fast execution models

that are very useful in some cases, that make a lot of

sense. And the way that they are tackling it, that we're

tackling with sparklyr and specifically streaming, it's

kind of like a very profound, interesting way. And the

credit goes to the Spark project. But the way that they

define Stream.

Javier Luraschi: So we're talking about Stream and we haven't even

really defined like, hey, what is Stream? Right? Well,

the way that we define what a Stream is in Spark is

like a table, like Excel spreadsheet, but when you open

Excel spreadsheet, you have a limited amount of rows,

right? Like you open it and you have, hey, I have like

20 rows or whatever two million. But you have like a

set number of rows. The difference with streams is

that we consider them us data sets that have an

infinite amount of rows, right? It's not true that they

have an infinite amount. But if you're looking at the

stock market and you were seeing like what's the price

of the Nasdaq every second. And if you try to see that


as a table, well it's a table, but I mean assuming that

the Nasdaq doesn't crash and disappears from planet

earth, that looks like an infinite data set, right? Of

data that just keeps coming.

Javier Luraschi: And it also have the quality that you want to process it

really, really fast. You don't want to wait of like, yeah,

I'll tell you what is a good prediction for the stock

market tomorrow. Right? It's like, well, I need it

instantly. So kind of like when you start looking at

those data sets, that data is coming really fast and it

never stops coming, structured streaming is like the

future that you kind of want to consider. So, that's one

of the newer features that I'm excited about. Doesn't

mean by any means that you need to get started with

Data Science and Big Data with streaming, right?

There's a million other ways to get started, but it's

definitely one of those features that it's pretty exciting.

I could talk here for hours, so just telling me kind of

like where do you want to kind of steer the

conversation.

Kirill Eremenko: I'm glad we touched on Big Compute and that it's a

part of Spark 2.0. So Big Compute is kind of, as I

understand it in this sense, is the ... Or the

predecessor for Big Compute is just paralyzed

computing. That's something I studied back at uni-

Javier Luraschi: Yeah it is-

Kirill Eremenko: About paralyzed processes.

Javier Luraschi: Yeah it is.


Kirill Eremenko: Okay. Gotcha, Gotcha. And Spark takes full advantage

and that sense that it's running in memory, right? Big

Compute happens in memory and not on disk drives.

Javier Luraschi: And we were talking about these verbs that exist on

Spark, that make it really easy to do operations, right?

Like, hey, I want to filter, I wanted to get the average, I

want to join these two data sets, right. Since you are

already familiar with parallel computing, you probably

also remember how hard it is, right? Or it used to be,

right? It's not trivial at all to say, well I have three

computers with small data set, now I want to do

calculation over these three things at the same time.

And it's like, well, it's actually pretty tricky.

Javier Luraschi: I wouldn't say necessarily that it's fully sold on Spark

'cause there's a lot of like, you know, if you're doing

genomic analyses are why not, right. You will probably

have to do customs things. But just talking specifically

about data analysis, that problem is well solved in

Spark. You want to have to worry about, hey, how do I

make these things run faster? You just explain it in

terms like similar things like what we were talking

about in our, we have the dplyr package. You can

arrange a pretty Big Data set, like this hundred

terabyte data set that we were talking about. You can

still sort that data set in 23 minutes just by saying

sort. Like sort parentheses, open parentheses, close

parentheses, what a Pi before that and why not.

Javier Luraschi: So, kind of like those things get ... Yeah and you're

right. Big Compute is not a new term. It's just

something that has been getting simpler and simpler.

And hopefully with time it will get even simpler. I don't


even know how that looks like 'cause I already think

it's pretty simple. Honestly we could also talk about

that 'cause I feel like some of the challenges today still

on, on distribute computing is about troubleshooting

when things are running at scale. Like in a lot of cases

things will run smoothly and you can run your

computation and be done with it. The reality today is

that, the harder computations you're doing, even

though it's easier to express what to do, there is still a

lot of like, hey, I need to troubleshoot. Why is this

machine failing? Does it have enough memory? My

hope is that in the future those things will get even

more and more automated?

Javier Luraschi: I'm just making stuff up at this point. This is almost

science fiction. But like it would be cool that before

you execute some data analysis and you say like, hey,

I wanted to sort these data set. It would be cool if the

tool would tell you in this case a Spark or sparklyr or

why not would tell you, hey, you're probably going to

need, like if you want to run it in one computer, it's

going to take you three hours. If you get up 10

computers with eight gigabytes of RAM each or

whatever, it's going to take this much.

Javier Luraschi: That part today doesn't exist yet. You need to figure

out like how big the cluster needs to be or how small

or get kind of like get some advice from your system

administrator if you work in a big company or a big

organization. But hopefully one cool thing that we

could do is just make it easier to say hey, we're going

to help you optimize what you need and we're going to

tell you. And then you can execute or why not. But


yeah, there's definitely like a lot of really interesting

work if other ... If a lot of people listening to your

podcast feel more on the software engineer track, I

would also encourage to explore kind of like those

areas.

Javier Luraschi: A lot of the questions that we get or I personally get

today to answer is I'm a software engineer, how do I

become a data scientist? And that's totally fine for

those people that want to become data scientists. But

there's also a lot of disciplines around data scientists

where people can apply their skills-

Kirill Eremenko: Like you. You're a software engineer working on Data

Science stuff all the time.

Javier Luraschi: Right. And I love software engineering and I wouldn't

change it, but it's surprising that I still find a lot of

very meaningful problems and interesting challenges

on Data Science without being per se a data scientists,

right?

Kirill Eremenko: Yeah. Gotcha. That's definitely a very interesting-

Javier Luraschi: Point of view I guess.

Kirill Eremenko: Yeah and career path that you've decided for yourself.

Javier Luraschi: I mean honestly I haven't thought about other career

paths, but I could see how someone that is doing

marketing or why not like could also take a focus. I

don't know exactly how that would look like. But you

could say, you know what, I like marketing but I want

to compliment it with Data Science and how do I really

apply marketing to Big Data and Data Science and

why not. So for those people that are kind of like by


curiously looking at Data Science or not I feel like

there's also strong career paths if you stay on

whatever you're doing, but put a strong focus on Data

Science or Big Data or Big Compute, why not? But you

stay where you're doing and that's probably also as

valid.

Kirill Eremenko: Very, very true. Before we move away from Spark, I

just wanted to ask you quickly from your experience

and from what you've seen in this space, how difficult

is it to learn Spark? You mentioned that it's quite

simple to use as in it paralyzes a lot of stuff for you.

You don't even have to think much about it, but in

general, what would you say, how long is the learning

curve from knowing some R programming and how to

do Data Science on a basic level in R to actually

knowing how to use Spark and querying Big Datasets

with Spark and with R. How long do you think that

would take?

Javier Luraschi: Well, so I would split this into two questions, right? So

if you're starting by your own and you don't have like

even a Spark cluster. If you're literally like finished like

if you're on your own, I feel like that could be

challenging 'cause it's like, well where do you even get

the computer? There's a lot of questions to ask and

how do you-

Kirill Eremenko: Amazon?

Javier Luraschi: Yeah, Amazon. So there's a lot of services out there.

There's Amazon, Amazon has service called EMR.

There's a service called Azure HDInsight, there's

Cloudera, there's Databricks. There's like 10 different


ones and I apologize for the ones that I didn't mention.

So there's definitely ways of getting started if you're on

your own. But I think that's usually not the case.

Usually what happens is, well there's ... We can break

it down again into two. One is like if you're learning, if

you're like, hey, I want to learn about R and I want to

learn about Spark. Well I have very good news for you

'cause it's actually super, super easy. In fact that's one

of the goals that I have that brings me the most joy is

to make it absolutely insanely easy for you to get

started with Spark and R. So you can get started very

easy. You download the sparklyr package by installing

it as any other R Package and then you literally run,

install_Spark, open parentheses, close parentheses,

run dot. It will install it for you and then you run

Spark connect.

Javier Luraschi: Don't try to do these from the actual podcast. But if

you want more information, go to spark.rstudio.com

and we'll help you get there. So it's totally real easy.

So, that's one case. If you're a student, definitely you

can do it and the barrier or the cost to enter is super,

super low. So give it a shot. The other way where you

might end up working with Apache Spark, which is

also really easy, is if you happen to end up working on

an organization that already has an Apache Spark

cluster, it is often the case that there's a cluster

administrator and there's someone whose job it is to

maintain that cluster or someone that is already

paying the bill with Databricks or Amazon or Google or

Azure why not, right?


Javier Luraschi: So if those clusters are already up and running and a

lot of times the data is already there, it also happens

to be very, very simple 'cause you don't have the

burden of setting that up, right? All you need to ask is,

hey, where's the cluster? What is the connection

name? If you listen to ... I'm almost sure that Nathan

probably talk about connection strings and why not,

right? All you need is a character string that tells you

where is the cluster. And so you basically put data into

Spark connect and that will get you up and running.

And cluster administrators are pretty good at helping

other employees or members of that community to get

up and running. So I think in both cases it's pretty

easy to get started.

Javier Luraschi: So those are I think just great news. There's almost no

excuse to not try it out. I would say that it is a linear

... Well, I would like to say that it's a linear learning

path. So it is not true that doing everything on Spark

is as easy, if you know what I mean. As we were

mentioning there's things that are very easy and newer

things like maybe Spark Structured Streaming or like

other topics. I work practically every day to make them

easier. But you kind of like want to start with small

steps which will get you very, very far regardless. But

then as you feel confident and feel proficient, you just

need to keep walking, right? It's just like a slope with

our [inaudible 00:51:38] client. Kind of like, if you're

like mountain climbing or what not or kind of like that

type of deal where you're like well, you start going, it's

not that hard and you definitely at some point you're

going to be hitting harder problems.


Javier Luraschi: But the very nice thing is that to get started it takes

very little. And I think that's important 'cause if it take

a really long time to even do simple things or why not,

it's like no one has time for that. But if it gets you up

and running and you can do meaningful analysis very

easily, which is where we are today, it's very easy to

get us started. It's very easy to learn. And then as your

problems get harder and harder and you are

answering more interesting questions that are

interesting to you or that bring value to whoever you're

working for or with, it just makes, makes it easier that

you just need to keep working there and there's, I

didn't mention it but there's a great community in R.

My guess is that some of the other R speakers here

mentioned it, but there's a very nice, warm community

around R and specifically also about sparklyr and why

not.

Javier Luraschi: So you can go to resources to community.rstudio.com

and just ask for help. You can also go to our GitHub

page and look at the issues and if something feels like

that you really need help with, you're going to open an

issue. So you're also not alone in this and there's a lot

of resources to get you up and running. So I definitely,

at least will encourage everyone to, if they're curious

about Big Data and Big Compute and Cluster

Computing, give it a shot, 'cause you're going to be

surprised. That is something that feels doable. Which

would not have been the case few years ago.

Kirill Eremenko: Fantastic. Thanks so much Javier for that description.

Hopefully that will encourage more people to jump into

learning Spark, especially through R. It sounds like


you guys really make it happen. Make it easy and

those who are already using sparklyr like Javier he's

making things I like for you in the space. Well there's

so many more things that I wanted to ask you about

Big Data, about your journey into this career and so

on. Unfortunately we're running out of time and

probably we only have time for one more thing. And I

thought out of all the things that I have written down,

I thought the most important one would be your book.

Kirill Eremenko: So you mentioned before the podcast you're working

on a book and seeing how much value you're giving on

this podcast, how passionate are you about the space.

I think it would be a shame for listeners to hear that

you and some of your colleagues or some of your

friends you're working on a book and that is going to

be published next year. So tell us a bit about that.

What is this book going to be about so people are

interested in this space can look forward to it.

Javier Luraschi: Right. So the name of the book is going to be

somewhere around the lines of, the R in Spark. Kind of

like funny name the R, what is the piece that we're

going to highlight from Spark or R, the R programming

language. So we already have a website and we're

going to put more information when the book is

published, but the website is, therinspark.com. So it's

pretty straightforward. But for now it's a bit of a

placeholder, but if you're interested at least it's good to

keep in your bookmarks. But my goal with that book is

to make it, well both with sparklyr and with the book

is to make it the absolutely easiest way to get started

with Apache Spark. So anything from, hey, what is


this thing which we already ... Your listeners are lucky

enough to have you that they already know, kind of

like what's spark right and they got a very nice

introduction.

Javier Luraschi: So the goal of this book is to make it very, very easy for

anyone that opens the book to say, wait, [inaudible

00:55:37] is the Big Data? Oh, that's what it is. What

is spark? Oh, now I get it. Okay, now that I

understand it, how do I get started? It's going to be a

very gentle introduction. But being gentle, it's not

going to remove the fact that if you go through the

whole book we want to take you from being a very new

user to being close to being an expert. And like any

book, you need to do the exercise and practice and

why not. But we're definitely hoping that we can bring

a lot of people to the Apache Spark community as well

and just help them out. So yeah, definitely if you want

to keep in touch, therinspark.com might be the place.

Kirill Eremenko: Awesome. I'm actually looking at it right now and

there's for our listeners, if you're interested in what

Javier was talking about today and you found it

exciting to listen to R in Spark, if you go to the

website, therinspark.com, there's a way you can get

early access to the book. So you just have to send an

email to Javier and you'll get early access. I think that

would be pretty exciting to get early access to the

content. So yeah, if you guys are interested, jump on

top of that. Javier, we're out of time. Thank you so

much for coming on this show. Being an amazing

journey just listening. Totally, totally captivated too.

You have so much passion for this space. Before I let


you go, I have to ask, what are some of the best ways

for our listeners to get in touch with you to follow up?

You mentioned the R in Spark, what other ways can

our listeners get in touch with you?

Javier Luraschi: Yeah. I would say Twitter, but I'm honestly so bad at

Twitter. I need to listen for 10 tips to be a Twitter user.

So Twitter, definitely you can find me there Javier

Luraschi and that's just Twitter, first name, last name.

I'll do my best to answer there. But definitely on the

GitHub page and we also have a Gitter channel. So it's

going to be pretty easy if you start looking at the

sparklyr, if you search for sparklyr, one way or

another one, you're probably going to end up reaching

myself and other colleagues in RStudio. So I would

just say just don't be afraid, whichever form you find

of contacting us, there's a Gitter channel where we can

chat. There is Twitter if hopefully I get better at using

it. There is GitHub, there is the book that will take us

a few months to get it to you. But just in general,

whichever way you can find us, just feel free to keep in

touch and I'm looking forward to that.

Kirill Eremenko: Awesome. And Linkedin, is it alright for listeners to

connect on Linkedin?

Javier Luraschi: Yeah. For sure. It's first name, last name altogether. I

don't think they have like a nice search in there, but I

my first name Javier, last name Luraschi. Twitter as

well. You can find me there. I don't know if I miss

anything else, but I can give you my address if you

want to. Just kidding.


Kirill Eremenko: And your social security number and where the money

is.

Javier Luraschi: That's for sure. Yeah, we'll put it out there.

Kirill Eremenko: Okay. No, I think that's all. Well, once again, thanks

so much Javier. Good luck with the book. Looking

forward to it coming out and I'm sure a lot of people

are going to get from this podcast. Thanks so much.

Javier Luraschi: Well, Kirill thank you so much for having me and

grade work with this podcast. I'm really happy I had

the chance to be with you today, here.

Kirill Eremenko: So there you have it. That was Javier Luraschi from

RStudio. Hope you enjoyed this session as much as I

did personally. My biggest takeaway and fairest was

the whole fact that we dove into Apache Spark so

deeply and got to know this space so well from

firsthand, from a person who actually works in

developing a package to work with Apache Spark. So

Javier is up to date with all of the changes in Apache

Spark and he knows exactly everything that's

happening in this space. So it was really great to hear

this information, these insights directly from him. And

I'm sure you can also feel the immense passion than

Javier has for the space and in fact it's even

contagious, so I'm sure if you have never heard of

Apache Spark before, now you can feel that this is one

of those really powerful tools that maybe one day you'll

add to your Data Science toolkit.

Kirill Eremenko: On that note, you can get the show notes for this

episode at www.superdatascience.com/211. There you

will also find all the things that we mentioned in the


episode. All the materials we mentioned in the episode,

a URL to connect with Javier and follow him and his

career on all social media. You'll also find a link to the

upcoming book, to the website where you can register

to get some contents of Javier's upcoming book, which

is going to be pretty awesome based on what we heard

today. And of course, if you know anybody interested

in the space of Big Data, in Apache Spark who wants

to learn more, and who would be as excited about this

episode as you hopefully and of course as I was on

today's show, then please forward them this link. Help

spread the word, help other people learn these topics.

Apache Spark is a really cool tool that is helping data

scientist's work of Big Data. So let's help each other

out. Send this episode to anybody who you think

might benefit from it. Whether it's a friend, colleague,

family member or somebody that you just know that

this will help them out.

Kirill Eremenko: On that note, thanks so much for being here today

and sharing this hour with us. Can't wait to see you

back here next time, and until then, happy analyzing.


sds podcast episode 211 with javier luraschi · 2018. 11. 22. · have already incontinent the dark...

Documents