sds podcast episode 51 with randal scott king€¦ · and he'll give you some tips and tricks...

37
SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING

Upload: others

Post on 14-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

SDS PODCAST

EPISODE 51

WITH

RANDAL SCOTT

KING

Page 2: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Kirill: This is episode number 51 with Global Analytics Consultant

Scott King.

(background music plays)

Welcome to the SuperDataScience podcast. My name is Kirill

Eremenko, data science coach and lifestyle entrepreneur.

And each week we bring you inspiring people and ideas to

help you build your successful career in data science.

Thanks for being here today and now let’s make the complex

simple.

(background music plays)

Welcome to the SuperDataScience podcast everybody. Hope

your week is going great, and today we've got a very

interesting guest. Scott King is an analytics consultant, also

the founder of Brilliant Data, and also a renown analytics

speaker. Scott comes into organisations to help them with

their data capabilities. He helps executives with their

strategy, helps them understand what are the best tools for

their organisation, and which is the best way to proceed

going forward into the future.

And this podcast, we predominantly talked about Hadoop.

So if you've always wanted to find out a bit more about

Hadoop, or really what's going on at the cutting edge of

technology in this space, then this is the podcast for you. In

today's episode, you will learn a lot. And I mean a lot. Get

your pens and papers out, because today, you will learn

about HDFS, MapReduce, Hive, Hog, Pig, Kudu, Impala,

Spark, Seahorse, and what a data lake is, and much, much

more.

Page 3: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

And my favourite part is that Scott has a natural ability to

explain these things in a very, very simple manner so you

don't have to be highly technical, you don't have to already

have a lot of knowledge about Hadoop to understand all of

these things. It'll be very easy for you to get up to speed. So

this is your crash course into Hadoop. By the end of this

hour you will know how to operate these terms, or at least

be up to date with what's going on in this space.

In addition, we will also talk about Scott's public speaking

and he'll give you some tips and tricks on how to prepare for

data science presentations, how to present to different types

of audiences, and the things that he does that help him out

when he's doing public speaking.

So all in all, very interesting, exciting podcast. Can't wait for

you to check it out. And without further ado, I bring to you

Scott King, Global Analytics Consultant and Founder of

Brilliant Data.

(background music plays)

Hello everybody and welcome to the SuperDataScience

podcast. Today we've got Randal Scott King calling in from

Atlanta. Randal Scott is a Global Analytics Consultant. How

are you going today, Scott?

Scott: I'm doing very well, how about yourself, Kirill?

Kirill: I'm well as well. And just for the benefit of our listeners, can

you explain the whole legacy of names in your family?

Because your name is Randal Scott, but everybody calls you

Scott. How does that come by?

Page 4: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Scott: I think that's just a thing in my family. We give people first

names and then don't use them. My father is John Richard.

He went by Richard, I think probably because his father was

also John Richard.

Kirill: Ok, so John Richard the junior, right?

Scott: Yeah, exactly, yeah.

Kirill: And you said you have 5 children, are you going to also give

them double names as well? That's 10 names you gotta

think of!

Scott: Well, yeah. So as a matter of fact, yeah. As a matter of fact,

my oldest daughter goes by both of her names.

Kirill: Ok. Alright. Gotcha. Ok, well thanks a lot for coming on the

podcast and so for those of you who haven't heard what

Scott does, Scott is a Global Analytics Consultant and he

consults companies like the Fortune 500 companies in

analytics, in spaces like Big Data, and Hadoop, just to name

a few. And so this is going to be a pretty exciting chat. So

Scott, tell us how you got into this whole space of analytics

consulting.

Scott: I started as a user. I was actually the Director of Business

Development at an IT reseller. And I was responsible for

making sure the company sold $100 million worth of Cisco

product.

Kirill: Wow.

Scott: And I realised one day, I was like, I found on the company's

internal website that somebody would publish the entire

dump from the Oracle ERP that had every transaction year

Page 5: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

to date on it. And I realised pretty quickly just how powerful

that was. I spun up an instance of Pentaho. They have an

open source version, but it's a BI package. And I would go

out and get that database dump every day and put it on my

laptop and I started looking at the data and realized “Oh, my

gosh. I can find out everything about the business. I can find

out who’s selling what, where, and to whom.” It was really

useful in my job as business development to know all of that

about the business. We were able to post some really strong

gains year over year for the three years that I was there by

just knowing what was going on.

Kirill: Wow! That’s powerful.

Scott: So, BI is great. It tells you what’s going on with the company

right now and in the past, but it was really when I

discovered machine learning and I was like, “Oh, my God. I

can predict things now.”

Kirill: Yeah, totally. And then you slowly transitioned into the

space of Hadoop or you do both at the moment?

Scott: So once we started doing that, we started going back further

in time. We had something like 8 years’ worth of data.

Obviously, that outgrew my laptop really quickly. (Laughs)

So the light bulb just went off one day and I was like, “You

know what, if we’re getting this much value out of this kind

of stuff, we could be doing this for clients.” So yeah, I left

and went out and started Brilliant Data and started doing

that sort of thing for clients.

Kirill: Okay. That’s pretty cool. Are you consulting on your own or

do you have a team of people that you’re working with?

Page 6: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Scott: I’ve got a small team. We’re a small boutique firm but we’re

getting to be pretty well-known for getting results.

Kirill: Okay. That’s great. I’m actually looking at your LinkedIn

now and there’s so many people leaving comments about

how you presented within companies and on Hadoop and

really shaped other companies’ ways of thinking, so CEOs

and directors are leaving great comments. Can you tell us a

bit more about that? Apart from implementation, you’re

obviously also pushing the envelope in terms of an analytics

agenda and building this analytics culture in different

organizations. How do you go about doing that?

Scott: Well, you know, I’ve always done public speaking in some

form or fashion throughout my career. I’ve been in IT

consulting for about 20 years and have always either taught

classes or gave sales pitches as a sales engineer, any

number of things that have to do with speaking. And I

realized, when I started Brilliant, I was like, “You know, this

is going to be a valuable tool for bringing business in, is

getting out there and speaking and kind of starting the

conversations around things.

These days I see it more as kind of educating people on the

state of analytics and the state of big data because it really

is changing so quickly all the time. In the past, you really

wouldn’t want to ever use Hadoop as a data warehouse

because there were some significant limitations there, but

there’s been some things that have come along just recently

that have kind of changed that now and it’s at least possible,

if not advisable, to do now. A lot of people don’t know that.

You know, you kind of get out there and let people know.

Page 7: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Kirill: Okay, gotcha. So basically, two areas that I want to dig into

further is, of course, your public speaking and your

experience there and also Hadoop and what you can tell us a

bit more about that. Maybe let’s start with Hadoop. In five

sentences or less, what is Hadoop for a person listening to

this podcast who hasn’t encountered Hadoop or just heard it

as a buzz term?

Scott: Well, there’s really two reasons why you would want to use

Hadoop. You’ve just got more data than a traditional

relational database can work with, or you’re wanting to work

with types of data that a relational database can’t work with.

Kirill: Interesting. And how much data are we talking that a

relational database can’t handle? A gigabyte, 10,000

gigabytes?

Scott: No, certainly not in gigs. You’d have to get into the upper

terabytes or even petabytes before you’d really max out most

of the really good relational databases like your Oracles and

whatnot.

Kirill: Gotcha. Does only the amount of gigabytes matter or also

the — you know, people probably have heard of the three Vs

of Hadoop – velocity, variety and veracity, I think, or

variability, maybe 4 Vs. Does it have to be just the volume of

data, or can it also be the different types of data that you

have in your dataset?

Scott: Well, yeah. It’s really those two things. And you’re right,

there are the three Vs, but the two that I think of when I’m

talking to clients and figuring out if they actually need

Hadoop – because a lot of people think they need Hadoop

and they don’t. I told somebody recently — they had SAP,

Page 8: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

you know, they were wanting to build dashboards and I’m

like, “You do realize you can do that with what you have,

right?” But the two things that I think of when a client says,

“We need Hadoop,” is “Okay. Do you have the volume of data

necessary for Hadoop?” “Well, no.” “Okay. Do you have a lot

of different kinds of data that you want to work with that

you would find difficult to put into a database and they’re all

like, “Yes.” You see, we’ve got these customer service

transcripts that we want to go through and do text analysis

on them, etc. You know, you’re not going to do that in a

traditional relational database.

Kirill: Yeah, gotcha. Understood. All right, so Hadoop if you have a

lot of data, if you have lots of different types of data. So, how

does this happen? Somebody calls you up and says, “We

want Hadoop.” You ask them the questions and then you

say, “Either you need Hadoop or you don’t.” But if they do

need Hadoop, what happens from there?

Scott: Well, it’s a matter of determining what they have and what

they want to bring into Hadoop and what is the ultimate

business benefit that they want to achieve. I think that’s

really where a lot of Hadoop—there’s all kinds of press these

days about the demise of Hadoop. I think it was Mark Twain

who wrote to his local newspaper, they had published an

obituary for him, and he said “The news of my death has

been greatly exaggerated.” I think the news of Hadoop’s

death has been greatly exaggerated. I think that goes back to

the Gartner hype cycle. You know, we’re in the trough of

disillusionment right now and I’m really looking forward to

the — what is it, the plateau of productivity?

Page 9: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Kirill: Something like that, yeah. So, basically you go in and you

implement the system. How long does it take? What are the

main challenges that you face when you’re implementing

Hadoop at an organization?

Scott: Well, to actually stand up a Hadoop cluster doesn’t take very

long at all. To bring in the client’s data and to do the

groundwork necessary for that to figure out—to sit down

with them and say, “Okay, here’s the problem or problems

that you’re trying to solve” or “Here’s the additional

capabilities that you want to develop,” that takes a while.

Bringing the data in takes a while. You know, people think

of Hadoop as that whole data lake concept of “Oh, just throw

it all in the data lake and we’ll figure it out later.”

Well, you can do it that way, sure, but I wouldn’t advise it.

But sitting down with the client and figuring out what it is

that they’re trying to accomplish, figuring out what data

they’re going to need for that and how to structure that

inside either HDFS or Hive or now if it’s a Cloud area, install

Kudu, which is a great little tool, we can talk about it in

more detail later if you want. But figuring out what exactly it

is that they’re wanting to accomplish and how to set that up

in Hadoop is what takes the longest time. Usually, once the

data is in and it’s in the format that you want, doing

analysis doesn’t really take all that long.

Kirill: Yeah, gotcha. And there are tools like Pivotal, for instance.

It’s a pretty good tool to use on top of Hadoop to make things

easy. They have PivotalR even, to allow for that. Okay, so

you’ve mentioned a couple of words – HDFS, Hive, Kudu.

Page 10: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Could you tell us a bit more? Let’s start getting into the

technical side of things.

Scott: Sure. The central idea behind Hadoop is that it’s a

distributed system, right? So you take a big problem like

analysing terabytes or petabytes of data and you cut it up

into smaller problems and you spread that load out over the

individual servers that make up the cluster and then they do

their part of it and send it back to you and you reassemble

the whole thing on the other side. Hadoop is all about

distributed computing. And HDFS, which stands for Hadoop

Distributed File System, is exactly what it sounds like. It’s a

distributed file system that spans the whole Hadoop cluster

and sits on top of the file system on each node.

Kirill: Okay. So that’s basically the system that connects

everything together?

Scott: That’s one way of thinking of it, yeah. I mean, it’s a file

system just like any other, but it spans the whole cluster

and as a result it can be massive.

Kirill: Okay. And it’s one of the main advantages of Hadoop, is that

it’s scalable, right?

Scott: Yes. You want to make it bigger, you just add machines to

the cluster.

Kirill: Okay. And how does that compare to using a super

computer? Why would one use Hadoop over using a super

computer?

Scott: Super computers tend to be very, very costly whereas

Hadoop uses commodity hardware. I tell people Hadoop is

the exact opposite of virtualization. With virtualization you’re

Page 11: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

taking one big physical server and you’re running a bunch of

virtual servers on it. Hadoop does the exact opposite. It

takes a bunch of physical servers and makes one big virtual

server out of it.

Kirill: Okay, gotcha. So what happens when one of those individual

servers or commodity hardware breaks down?

Scott: Nothing. That’s the beauty of it. So HDFS replicates data a

minimum of three times. There will be three nodes that your

data is on. So if one of those nodes takes a dive, let’s say the

power supply sparks and you have to pull it out and replace

the power supply or just replace the entire machine. Well,

when you put it back in, HDFS will actually rebuild the file

system that was on it.

Kirill: Okay. That’s pretty cool. So basically you have three copies

of your data on separate nodes, but are they all being

processed simultaneously, or data is still processed on each

individual node but that is just kept as a backup?

Scott: Well, what HDFS does is let’s say that you have a terabyte

worth of data, it’s some kind of big table or something. What

HDFS is going to do is it’s going to take that and divide it up

into blocks. The size of the block is configurable, but I think

the default these days is a 128 MB. Sometimes you might

want that to be 64 MB, sometimes you might want it to be

256. Like I said, it’s a configurable parameter. HDFS is going

to take that big terabyte file and it’s going to chop it up into

128 MB blocks and it’s going to replicate each block three

times.

So if you have a 20-node cluster, for example, that first

block might be on servers 1, 3 and 5. The second block

Page 12: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

might be on servers 2, 15 and 10, etc. So the whole data gets

spread out over the cluster and then the name node, which

is the part of the cluster that decides where things go and

decides who does what, it will pick one of the nodes that that

first block of data is on and say, “Okay, your job is to

calculate this block of data for this job.” Does that make

sense?

Kirill: Yeah, gotcha. And what about MapReduce? Can you tell us

a bit more about MapReduce?

Scott: Well, I kind of just did. (Laughs)

Kirill: Okay. So that is MapReduce?

Scott: MapReduce is the computation engine, or the original

computation engine, for Hadoop. MapReduce is the part that

does the actual calculating anything on the data in Hadoop.

Since Hadoop v2 and with the introduction of something

called YARN, Yet Another Resource Negotiator, there are

additional computation engines like Tez and like Spark that

can run on a Hadoop cluster now, you’re not just limited to

MapReduce. And that’s a good thing because MapReduce

has some significant limitations.

Kirill: Okay, gotcha. All right. Moving on to this whole animal

kingdom of Hive, Kudu, Pig, etc.—

Scott: Oh, my gosh. It grows every day, man.

Kirill: (Laughs) What can you tell us about that? First of all, why

are the names all animals?

Scott: That’s a really good question. I’m not sure, but I do know

that Hadoop itself, that name came from Doug Cutting’s

Page 13: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

young son at the time, and this was 10 years ago so that kid

could probably be an adult by now. But at the time, he had

this little toy stuffed elephant, this yellow stuffed elephant

that was named Hadoop and Doug decided to name that

project after that little toy.

Kirill: It’s like kids – inventors or developers on these things and

their kids. Same thing with Pivotal. I think they have a

product called Plum, or it’s a name of the company—

Scott: Greenplum.

Kirill: Greenplum! So, to his child at the time, he was like creating

and said, “Oh, I have a cool idea. What should I call it?” And

the child picks up an apple from the basket of food that they

have at the table and the guy is like, “Well, I can’t call it

Apple. Apple is already taken. What’s the closest thing to a

green apple? A green plum.” So that’s why they call it

Greenplum. It’s just funny. Anyway, Hive – what is Hive?

Tell us about that.

Scott: Let’s say you bring in a bunch of CSV files into HDFS and

you want to be able to run SQL queries on those files. Well,

that’s what Hive does. It creates tables on top of these files

to make them searchable by SQL queries.

Kirill: Okay, gotcha. So, it kind of makes unstructured data

workable through this traditional structured approach.

Scott: It’s kind of an interpreter, really. It’s an abstraction that sits

on top of MapReduce and it translates SQL code into

MapReduce Java and sends that to MapReduce and says

“Hey, execute!”

Page 14: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Kirill: Yeah. Pivotal have a similar thing to that. You can do SQL

on Hadoop through a Pivotal product as well, I just don’t

remember what it’s called.

Scott: I want to say that’s HAWQ.

Kirill: Exactly, that’s right. Another bird, another animal. (Laughs)

Scott: Another animal. (Laughs)

Kirill: I’m not surprised. All right, cool. That’s good. So basically

even somebody not knowing how to work with unstructured

data and do Java code and all those other things, they can

still work with Hadoop through things like HAWQ or Hive.

Scott: Yes. And there’s actually more abstractions on top of

MapReduce. Like Pig you mentioned earlier, it’s a scripting

language and it’s much easier to learn than Java. Again,

basically all it does it takes that scripting language, turns it

into Java and sends it to MapReduce and says “Execute.”

Kirill: Okay, good. That’s good. So we’ve covered HDFS,

MapReduce – this is value – Hive, HAWQ, PivotalR, Pig.

Okay, next one on the line – Kudu. You said that’s

Cloudera’s creation. Tell us a bit more about Kudu.

Scott: Yeah, a lot of people have tried to implement Hadoop in a

way that you would normally only want to do with a

relational database. I mean, Hadoop and relational

databases were never meant to do the same things. But a lot

of people have tried, for example, to deploy Hadoop as a data

warehouse. This is something that I’ve advised people

against in the past because the things that you have to do to

make that work are just [indecipherable 23:11], for a lack of

a better term. The problem is that data in HDFS is

Page 15: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

immutable. You don’t change data in HDFS. You either

delete it and overwrite it or you append.

Obviously, that creates some problems as far as trying to

use it as a data warehouse because data warehouses update

information all the time. So what Kudu really does – and I

don’t understand why Cloudera is not marketing it this way,

but they’re not – but what Kudu does is it basically

eliminates the immutability issue. Data in Kudu is

absolutely updatable. You can do updates, you can do

inserts, you can do deletes just like you would on a

relational database.

Kirill: Okay. That makes sense. So how does Kudu do it in terms of

— if HDFS restricts the updating of data, how does Kudu go

around that?

Scott: Well basically it does it by not using HDFS. So, Kudu is

actually a storage engine, you know, they’re very hesitant to

call it a database because really it’s not, but it is a storage

engine that is updatable whereas HDFS isn’t. So you can

have HDFS and Kudu running side by side on a Hadoop

cluster and use them for different things, but Kudu does not

rely on HDFS at all.

Kirill: Okay.

Scott: Hive relies on HDFS. Kudu just has its own thing for that.

Kirill: Okay. How about next in our line-up of animals and

creatures – Spark and Seahorse? What can you say about

Spark and Seahorse?

Scott: Well, going back to what I said earlier about MapReduce,

you know, it was the original computation engine. But in

Page 16: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Hadoop v2 they introduced YARN. I like to call YARN the

mom of the cluster. You know, when you were a kid and you

wanted something or you wanted to do something you went

to mom, right? Because if you went to dad, he would just

say, “Well, what does your mother say?”

Kirill: (Laughs) Totally, yeah.

Scott: But, yeah, I mean, if you and your brother or you and your

sister wanted a cookie and there was only one cookie, mom

was going to be the one who decided who got the cookie,

right? So YARN does the same thing in a Hadoop cluster.

YARN is administering the resources of the cluster and

deciding, if you’ve got a bunch of different jobs running on a

Hadoop cluster, YARN is going to say, “Okay, I’m giving this

job this much processing power, etc.” That used to be in

MapReduce. So, in v1, MapReduce did all that. But they

took that out of MapReduce and put it into YARN so now

MapReduce just does computation. And what that did is it

opened up the window to other computation engines and

Spark is one of them.

Kirill: So Spark goes instead of MapReduce?

Scott: Yeah. And the great thing about it is, because of the way

YARN works, you don’t have to choose either/or. So, for

example, if you’ve got a job that could really benefit from

MapReduce’s bash-oriented nature, which a lot of ETL

processes would qualify, then you can have that job running

in MapReduce while over here you’ve got a different job

running in Spark and taking advantage of Spark’s greater

speed.

Kirill: Gotcha. So advantage of Spark over MapReduce is speed?

Page 17: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Scott: The advantage of Spark is really it’s orders of magnitude

faster than MapReduce. The reason for that is MapReduce is

very disc-oriented and very batch-oriented. It’s just slow.

When MapReduce was invented, they weren’t really all that

concerned with speed. They were just concerned with being

able to do the kinds of things that MapReduce does at all.

So, whereas MapReduce is batch-oriented and disk-oriented

– I mean, it does everything on disk – Spark will actually

load data into memory. And again, it’s just orders of

magnitude faster because of that.

Kirill: Gotcha. All right. And what can you say about Seahorse?

Scott: You know, Seahorse is this brand-new thing that I just

stumbled across maybe a month and a half ago. What

Seahorse does is it actually gives you a graphical front end

to Spark. So Spark has these great machine learning

libraries – ML and MLlib – and Seahorse gives you a

graphical interface to using those. I’m noticing it really

speeds up prototyping for me. I loaded it on my laptop and

interfaced with the cluster here at the office and I’ve noticed

that rather than hand coding everything in Python or Java

or whatever, having that graphical front end enables me to

prototype things so much quicker.

Kirill: Really, they are just punching these out, like one and a half

months ago, and you’re obviously in the centre of everything

that’s going on with Hadoop, how quickly do they release

these things?

Scott: You were saying something about the ecosystem around

Hadoop earlier and how they’re all named after different

animals and all this and I was like, “Yeah, there’s a new one

Page 18: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

every day.” You know, again, you could spend all kinds of

time just keeping up with the latest on either Hadoop or big

data or machine learning or data science. I mean, there’s

always something new every week.

Kirill: And how do you do it? How do you keep up with everything?

Scott: Who says I do? (Laughs)

Kirill: Well, you seem to be pretty up-to-date with everything we’ve

discussed so far. You definitely know what’s going on.

Scott: I think keeping current on anything in tech is really a matter

of community. I introduced a friend of mine who’s a data

scientist at a benefits company. I sent him an e-mail and

said, “Hey, have you seen this thing?” I was talking about

Seahorse. “Seahorse is basically a GUI for Spark.” He’s like,

“Whoa, I love this!” And there’s always somebody who says,

“Hey, Scott, have you seen this thing over here?” Sometimes

it’s, “Yeah, I have. I think it’s great.” And sometimes it’s,

“Oh, my God, no! What is that?”

Kirill: It’s a seahorse. (Laughs) Okay. That’s pretty cool.

Summarizing all of these things up together, can you give us

a quick description, what do people mean when they say

data lake? Obviously these tools somehow altogether assist

or facilitate data lakes or are used in data lakes. What do

people mean by the term data lake?

Scott: I think different people mean different things by it. I think a

lot of people use the term and don’t even really know what

they mean by it.

Page 19: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Kirill: All right. Let’s make it clear for everybody so that when

people use it in the future, they know what they’re talking

about.

Scott: Yeah. So, when you think about a lake — a lake has layers,

right? You’ve got the sediment and you’ve got kind of the

murky depths and then you’ve got the top. And really, what

it’s all about in terms of making that useful for an

organization is you’ve got different kinds of data that you

can put in there. You know, earlier we said that thing about

some people talk about how Hadoop is where you can just

throw things and deal with it later. You really can. You can

use Hadoop to archive data and just get it in there and

figure out what to do with it a couple of years down the road

when there’s some use for it. Or you can have it in a Hive

table or a Kudu table for ready access.

I think really the whole data lake concept came about from

having a place where you could put everything, which was

kind of the goal of the data warehouse, but data warehouses

have very strict schema and whatever data you’re going to

put in there has to conform to that schema. With Hadoop

you don’t really have that restriction. You can put anything

in there. And I think that’s really the whole point of the data

lake concept. Yeah, there’s all kinds of stuff that you can put

in there and you can either start using it immediately or

figure out what to do with it later.

Kirill: All right. But how do I imagine it because it’s so different to

what people are used to, whether it’s SQL or Excel or just

folders and files. Can I imagine a data lake as just a huge

folder where I can just create new folders and just dump all

Page 20: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

of my videos or all of my texts, scanned documents,

whatever I want? Does that analogy make sense? Do you

think of it as folders containing certain information? Or does

it look like something else to you?

Scott: No, that’s a very apt description of it. I mean, it’s a file

system much like any other, HDFS is. And you can put

almost anything onto it. Like, you mentioned video. You can

put images, you can put audio, you can put texts. Really,

whatever it is that you want to stick on there, you can stick

on there.

Kirill: Okay, gotcha. Thanks for the confirmation. It makes things

a bit more clear now. And with your public speaking, so you

have all this vast knowledge on in-depth topics and you

know obviously how to set these things up and communicate

them well. When you go into a company and you do the

speaking there to the employees and even executives, what

is your main goal? What is your main goal that you’re trying

to communicate to them and what do you want as an

outcome for them at the end of your conversation?

Scott: Well, I mean, it’s really different. If I’m doing public speaking

as a public speaking thing, if they’re bringing me in to do a

30 minute, 60 minute, 90 minute keynote, then it’s a matter

of I’m trying to find out what it is that they want to

accomplish. For example, there was this benefits company

here in the States. They have like 300 Java developers and

they had done a proof of concept with Hadoop and they were

going to move forward with using Hadoop and they wanted

me to come in and address these 300 Java developers at

Page 21: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

their annual conference that they have internally within the

company.

And I said, “Okay, what is it really you’re trying to

accomplish with this keynote?” And they said, “Well, we

want to educate them on Hadoop, give them the 30,000 foot

view of how it works and what the pieces, parts are, but

mostly we want them to not be afraid of Hadoop and

understand that their jobs are changing a bit, but they won’t

be changing that much and they won’t go away. And so

that’s exactly how I tailored the speech, was “Hey, guys.

Here’s what Hadoop is. Here’s why the company is going to

Hadoop. Here’s why your future as Java developers is very

secure with Hadoop: a) MapReduce is Java.” If you’re

programming in MapReduce, you’re not using Python or

Scala or whatever. That’s what they were trying to

accomplish. They wanted to kind of reassure these guys and

educate them at the same time. That’s what I tailored the

speech to do.

Kirill: Gotcha. Okay, it makes sense. Can you give us another

example of a different speech where you had to tailor to a

different type of audience?

Scott: Actually, yes. An IT organization asked me to come in and

address the executives in their own company and explain to

them why the IT organization wanted to use Hadoop to

augment the company’s existing data warehouse. You know,

I was talking to the senior executive who wanted me to do

this and I’m like, “Um, couldn’t you do this?” (Laughs)

Kirill: What was his answer to that?

Page 22: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Scott: He said, “Well, yes, but a) I’m not good at that sort of thing,

and b) I think it will carry more weight if it’s coming from a

third party.” (Laughs)

Kirill: Okay.

Scott: So, I was hired by the IT organization to convince executives

to go in a direction that the IT group wanted to go.

Kirill: Okay, definitely very different. I see how that’s working. So

what’s your biggest challenge that you face when speaking

to whether it’s executives or large audiences like that? Is it

hard to get the message across about Hadoop?

Scott: Not too hard, no. I’ve always had kind of a knack for taking

something that’s really complicated and breaking it down to

its essence and explaining that in a way that most people

can understand. To me, the most important thing is to make

sure that I understand what it is they’re trying to accomplish

with the speech – that’s really the biggest thing – and to

work with them and make sure that “Okay, what does

success look like at the end of this speech? What’s going to

happen? Who’s going to do what? What information is going

to get conveyed? Is there a skill that you want them to

develop? Is there an idea that you want to plant in their

head? What does success look like? How is life different after

this speech?”

Kirill: Okay, gotcha. Just on that, I wanted to ask you as well, for

people out there who are listening and who want to get into

speaking, or public speaking, on data-related topics,

whether it’s data science, machine learning or even Hadoop

like yourself, what would your biggest one piece of advice be

for them? Because obviously data science is a very kind of

Page 23: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

topic where you don’t interact with people that much, where

you’re very technical and going into speaking can be a big

shift for somebody. What would your best advice for a

person like that be?

Scott: I would probably say speaking is a skill just like any other

skill, like riding a bike or like coding in Python. You learn it

by doing it. If you haven’t done much speaking before,

probably the best thing to do is find opportunities to speak. I

hate to use the old cliché about “Go join Toastmasters,” but

people recommend that because it works.

Kirill: Okay, gotcha. And yourself, how did you get started into

speaking?

Scott: Oh, gosh. We had a piano in our house when I was growing

up. I grew up on a peanut farm in southern Alabama, so

there wasn’t a whole lot to do. I mean, there was plenty to

do, but there wasn’t a whole lot to do that was very

entertaining. At 6 years old, I started planting myself in front

of the piano and tinkering around on that and eventually

taught myself how to play piano. I want to say I was 12

when I started playing the organ at the little church around

the corner that had 70 people in it. And so from a very

young age I was used to being in front of people and I don’t

think that I ever had the opportunity to develop stage fright.

Kirill: Okay, so it’s pretty lucky circumstances that you put

yourself into.

Scott: Right. And speaking in front of people isn’t much different

from playing a musical instrument in front of them, really.

Page 24: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Kirill: Okay, gotcha. When you’re speaking to lots of people – I’m

just curious about this for myself – who do you focus your

attention on? Do you look at one person, or do you move

your eyes around the room to make everybody feel included?

Scott: Oh, you’ve got to move around the room, sure. I mean, if you

look at one person the whole time you’re speaking, you’re

going to accomplish one of two things. You’re going to make

everybody else feel kind of disconnected from you or you’re

going to really freak that person out.

Kirill: (Laughs) Yeah, they’d be like, “Why is he staring at me all

the time? I’m tired of his gaze.”

Scott: Yeah. So I kind of let my eyes wander the room as I’m

talking and kind of gauge different people’s reactions to what

I’m saying. Of course, the thing that you never want to see is

that person with half-shut eyes looking like they’re about to

nod off. (Laughs)

Kirill: Yeah, gotcha. That’s kind of like the next thing I wanted to

ask you. How do you structure your presentations in such a

way that people don’t fall asleep? Because a very common

flaw of technical presentations is that they’re either

overpopulated with formulas or overpopulated with text or

just the way the presentation is flowing just makes people

nod off and they can’t pay attention for more than 8 or 12

minutes. What’s your trick to that?

Scott: Well you have to know the audience, right? There’s nothing

wrong with a speech that’s full of technical information if

you’re speaking to 300 Java developers. There is everything

wrong with a speech that’s full of technical information if

you’re talking to a room full of business executives.

Page 25: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Kirill: Okay, gotcha. You want to keep it nice and sweet and short

for the business executives.

Scott: Well, keep it high level for sure. You know, don’t get into the

weeds with those kinds of guys. And be sure to make sure

that whatever it is you’re talking about this technical subject

relates back to some sort of business problem or business

result or something that they can relate to.

Kirill: Okay, gotcha. These speaking events, how do they link up

with the work you do at Brilliant Data? Do they detract from

your time that you could be spending implementing

Hadoop? Or do they facilitate that and actually help you

bring on more work into the company or better deliver the

projects that you're delivering by then following up with

some speaking events?

Scott: Well, I’m a consultant who speaks, not a speaker who

consults. Whenever I go and give a speech somewhere, it’s

always coming out of an experience that I had with a client

somewhere. And sometimes those experiences are kind of

humorous. That’s always great really because humour is a

necessary ingredient for public speaking. We were talking

earlier about keeping people’s attention and humour is great

for that. It also makes you seem a lot more human when

you’re on stage if you relate a funny story.

Kirill: Yeah, and approachable. Can you give an example of a

funny story you’ve used recently?

Scott: So, I started my career really as a network engineer, not

even really doing data. I was working for Cisco at the time

and they sent me out to this switch site for a major service

provider. They were having all kinds of errors. And when I

Page 26: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

got there, the facility manager seemed very amused that I

was there. And so I said, “Yeah, I’m here to troubleshoot the

problems with this equipment.” He said, “Oh, yeah. I know

where that is. We’ll go up to the top floor.” So we went up

there and he showed me the routers that were having the

trouble and I said, “Jimmy, what is this bright film on top of

the equipment?” He goes, “Oh, that’s where the roof leaks.” I

said, “Leaks?! As in this is still going on?” He says, “Oh,

yeah. I’ve been sending e-mails trying to get permission to

get the roof fixed for like 6 months now.”

Kirill: (Laughs) Well, there is your problem, right?

Scott: Well, yeah. And so, you know, they spent whatever they

spent. I’m sure it was not cheap to have Cisco send me out

there to troubleshoot this when they really could have

figured it out just by calling the facilities.

Kirill: Yeah, okay. Definitely. I see how that can make people a bit

more happy in your presentations and lighten the mood up.

That’s a pretty cool story. Moving on to some other questions

that I have for you, what is a recent win that you’ve had in

your career that you can share with us, something that

you’re most proud of?

Scott: You know, I would have to say it’s been about a year and a

half ago now. Packt Publishing reached out and said, “Hey,

you know, we really want to do a training course on Hadoop

and we think you’re the right guy to do that.” I said, “Okay,

yeah. That sounds like a lot of fun.” They give you this

compressed six week time period to get this thing done. And,

you know, I know what I want to say, I know how to do this

thing that I’m about to demonstrate, but getting it recorded

Page 27: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

and edited and put together into something that looks good

and sounds good is not easy and it’s very time-consuming,

whereas I thought, “Oh, yeah, six weeks. This will be done in

three.” No, it wasn’t. I think it was done in six weeks and

three days. (Laughs) But I think the end result was

something that I’m pretty proud of.

Kirill: That’s nice. So you have a course which people can take. So

our listeners, if they’re interested they can find your course.

Where can they find your course?

Scott: Well, they can go to packtpub.com or it’s also on the O’Reilly

site, oreilly.com. It’s called “Learning Hadoop 2”. There is a

book by that title and there is a video course by that title

and I did the video course.

Kirill: Okay, interesting. And how did you find creating videos?

Was it very different to public speaking and was it very

different to other forms of education that you’ve done before?

Scott: Well, it’s actually quite different from public speaking

because if you make a mistake you get to go back and fix it.

Kirill: (Laughs) So it’s better?

Scott: Oh, yeah. You know, you kind of think about what it is you

want to demonstrate and what you want to say about it

ahead of time. You know, if you’re going along and you kind

of trip over your words or you say the wrong thing or you

stub your toe on the leg of the table and you accidentally

curse, like I did one time, you just go back and you take that

out. Nobody has to know.

Kirill: Totally. Okay, next question is what is your one biggest

challenge that you’ve ever faced in your career?

Page 28: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Scott: Well, I’ll tell you. Leaving where I was, which was a very

comfortable position, I had a great boss, I had a great team

that I was working with. But leaving that and going out and

starting your own thing is very difficult. They say the three

hardest addictions to overcome are heroin, carbohydrates

and a monthly salary. Now, I don’t have any experience with

the first one, but I have lost 20 pounds before and I have left

a monthly salary and a very comfortable job to start a

company before. And those two are very hard things to go

without.

Kirill: Okay, gotcha. And was it tough the first few months or even

years when you left?

Scott: Oh, gosh, yeah. The first year was—let’s just say it was very

educational. (Laughs)

Kirill: Yeah, totally. I find it’s like nothing else. You learn so much

while you’re building a company, building a business,

building a team. How’d you go about building a team, by the

way? How long did it take you to hire your first employee?

Scott: Well, it took a while before I needed an employee. As far as

like starting the company, there’s this guy, Alan Weiss,

that’s done all these books on consulting and running a

consulting practice and all that. Something that he says is,

“When you’re 80% ready, go ahead and pull the trigger and

figure out the other 20% as you go.” I thought I was 80%

ready and then I realized after a year of looking back on that

year, I thought, “Oh, you know what? I was really only about

50%.”

Kirill: (Laughs) Yeah. But you pulled it off. That’s good.

Page 29: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Scott: Yeah, it took a while. My first employee was actually a

virtual assistant, really is what she was. She’s here in town,

but we met in a Starbucks around the corner from my house

here. You know, she started up a conversation and we’re

talking, and she asked what I was doing and I said, “You

know, I’m looking through these websites and I’m going to

hire a virtual assistant.” And she said, “Actually, I’ve been

out of work for about two years now. I had a baby and

stayed home and all that. I’m looking to get back into

working again and I’d be very interested.” So that’s kind of

how that whole thing came about. I told her, “Okay. Here’s

what the job would involve and here’s what the hours would

be and all that,” and it just kind of went from there.

Kirill: And how big is your team now?

Scott: I’ve got two salespeople, there’s myself, there’s a team of

subcontractors that I’ll use for various things, for example,

for user interface, design and anything graphical. There’s a

guy that I’ve known for 10 years and he’s probably one of the

top 10 in the country at that sort of thing.

So when I’ve got a project that’s big enough size that I need

to bring people in, I know who to call for that. And we’re

about to bring on a couple of interns, actually. I’m extremely

fortunate to be headquartered in Atlanta. Georgia Tech is

here and they have an extraordinary program for data

science and for machine learning. Some of the things they

do there are just incredible. We’re bringing on a couple of

interns who know data science and want to learn Hadoop.

Kirill: Okay, good. It’s like a win-win.

Page 30: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Scott: Absolutely. You know, if you’re going to do an internship, it’s

got to be win-win. There’s got to be something in it for them.

Kirill: Okay, cool. Next question is what is your one most favourite

thing about being in the space of data?

Scott: I’d have to say it’s just the diversity of things that you can

do, right? I mean, every company has data. It really doesn’t

matter what their business is, it generates data. I’ve been

fortunate enough to work with Fortune 500 companies and

mid-market companies and right now one of my clients is a

manufacturing company that makes plastic and metal

containers. Every engagement you do as a consultant in this

industry, you learn something, you know. Sometimes I learn

as much as the client I’m working with. Of course, I don’t tell

them that because then they’ll try to—

Kirill: (Laughs) That’s the best. Oh, my God, yeah. I totally

understand. I can completely relate to that because the

space is growing so quickly and the amount of new

technology and methodology that’s coming out all the time is

immense. It’s just impossible to know everything. Inevitably,

you’re going to be learning. Like, when I create a course, I

know some things. But some things I learn as I’m doing the

research for the content I’m creating. I’m sure you were the

same when you were creating your course. You definitely

had to do some research and come across some things that

were just coming out at the time.

Scott: And that’s one of the best ways to learn, I think, to be

honest with you. If you approach a topic that you know

something about and you’re like, “Okay, I need to really get

in-depth on this topic and really understand it,” I think one

Page 31: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

of the best ways to really accomplish that is to think in

terms of, “Okay, if I was going to make a course about this,

how would I structure this and what would I say?” That was

really how I learned how to do machine learning in Spark, is

because I’ve got a request to put together a course about it

and I said, “Sure, yeah. Okay.”

Kirill: “I’ll learn it.”

Scott: And I was like, “Oh, crap. Now I have to learn Spark.” You

know, that wasn’t too much of a challenge because I already

knew Python. The thing is, if you know Python and Pandas

and scikit, you will know how to use Spark with Python. And

that’s one of the great things about Spark that I forgot to

mention earlier, is that if you know either Scala or Java or

Python or even R, then you can use those in Spark.

Kirill: That’s pretty cool, yeah. I’ve personally—I haven’t done it

through Spark but I’ve used R on Hadoop through PivotalR.

That was pretty good. And there’s another one – MADlib.

Have you heard of that one? I think it’s a Greenplum or a

pivotal development just for Hadoop, a mathematical

package that you can apply as well.

Scott: Yes, I’ve heard of it. I know what it is, but I can’t say that

I’ve ever worked with it.

Kirill: Yeah, there’s just so many tools. You can’t work with

everything for sure.

Scott: Gosh, yeah. I mean, if you know how to code in R, then 90%

of that will translate over into using R inside Spark to do

machine learning.

Page 32: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

Kirill: Yeah, gotcha. By the way, Tableau on top of Hadoop – how

do you guys do that?

Scott: It’s not really a whole lot different from using Tableau with

any other data source. It depends on what your engine is,

and I hate to use that term. For example, in a Cloudera

cluster, Impala would be the SQL engine that you would

want to use because Impala is just orders of magnitude

faster than Hive. But you would just point Tableau to the IP

address of one of your data nodes and point it to port 21050,

which is the port that Impala listens on. And then from there

it’s just like you do with any other ODBC/JDBC connection

that you set up for Tableau.

Kirill: Okay, gotcha. So it’s possible, basically? Short answer is it’s

possible—

Scott: Yes. Short answer is, “Oh, yeah. It’s absolutely possible.”

Kirill: Okay, gotcha.

Scott You have to understand, if you’re working with immense

datasets, that those aren’t going to run as fast as smaller

ones and you don’t want to package them into a workbook.

(Laughs)

Kirill: Gotcha, yeah. Yeah, for sure. (Laughs)

Scott: What we did for a client recently is, you know, we set up

Tableau Server to have a live connection into Impala because

it was a Cloudera cluster. And then Desktop just connects

into Server to get the data and that ran pretty quickly. Didn’t

have a whole lot of problems with it because the datasets

weren’t really that immense. But even if you’re dealing with

just tremendously huge datasets, there’s solutions for that

Page 33: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

like AtScale, for example. AtScale is a nifty little tool. What it

does is it lets you create dimensional cubes on top of data in

Hadoop, but then it’s also got this adaptive cache, so it

learns what data you access the most and it will actually

cache that into memory. So it speeds things up

tremendously in terms of Tableau.

Kirill: Nice. That sounds like pretty solid foundation for Tableau.

Wrapping up this podcast, I’ve got another question for you,

like a visionary type of question or a philosophical one.

Obviously you’re deep into the space of Hadoop and data

and data science, machine learning. From where you’re

sitting and what you’re seeing, where do you think this

whole field of data science is going? And what should our

listeners look into to prepare for the future?

Scott: Let me narrow that down just a bit and pontificate on the

future of Hadoop, if you don’t mind.

Kirill: Sounds good.

Scott: I’m sure you’re familiar with the Gartner hype cycle.

Kirill: Yeah.

Scott: You know, we’ve gone through that peak where everybody

was just all about Hadoop, you know, “We’ve got to have this

and we’ve got to put it to use.” Now we’re kind of in that

trough of disillusionment where it seems like every time I go

online, I’m reading an article somewhere about the death of

Hadoop. I don’t think Hadoop is dead at all. I mean, I think

that whole idea is actually kind of silly. But, yeah, we’re

definitely in the trough of disillusionment, and then

afterwards comes the plateau of productivity, when

Page 34: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

everybody has calmed down and realized that it’s not the

cure for everything, but at the same time, this guy is not

falling either. And you know, we settle in and we get some

work done.

I think that’s where the future of Hadoop is, is that

companies are going to start realizing what Hadoop is

actually useful for. There is going to be enough people with

the skills to do things in Hadoop and we’re going to actually

start getting things—it’s not just going to be the Fortune

500s who are able to extract value out of Hadoop. Pretty

much anybody will be able to.

Kirill: Gotcha. That’s some good advice. People listening to this,

don’t get afraid that Hadoop is dying. It’s not dying.

Everything will be okay. Just plough along and it’s going to

be an exciting space to be in. In any case, all these skills

that you learn, they’re so transferrable to whatever. Even if

something does replace Hadoop, it’s not going to be that

much different. You will be able to transfer all these skills

anyway.

Scott: And that’s another thing, too. It seems like every time I turn

around, somebody is comparing Hadoop to Spark, which I

think is kind of funny because every time I’ve ever seen

Spark, it was running on top of Hadoop. While it is possible

to set up a cluster that runs just Spark, the reality is that

hardly anybody does it that way.

Kirill: Okay. Thanks for the insights. Thank you very much, Scott,

for coming on the show. It was very exciting, sharing all that

knowledge. From the surveys that we run and from I know

about our audience, we actually have quite a few executives

Page 35: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

listening to this podcast and quite a few managers and

business owners even. If any of them ever want to get in

touch with you to invite you to do a public speech maybe at

their organization or help install Hadoop or do some

consulting work, where is the best place they can find you?

Scott: Probably e-mail, I would say. It’s [email protected].

Kirill: Gotcha. All right, we will definitely include that in the show

notes as well. People listening out there, if you need to

contact Scott, you have his e-mail now. And one last

question I have for you is, what is one favourite book that

you can recommend to our listeners to help them become

better at what they do?

Scott: I'm going to up that and say two.

Kirill: Sounds good.

Scott: You know, probably 80% of any job that you do with data is

getting the data into the usable format, something that you

can work with. O'Reilly has a great book for that called “Data

Wrangling with Python.” That book probably has 80% or

90% of the stuff that you’re going to do on a day-to-day basis

with data. It’s all about cleaning data, scraping data off the

web. I mean, if there is some technique or method for

preparing data, it’s probably in that book.

And the second one is Hadoop-specific and that’s “Hadoop

MapReduce v2 Cookbook” which is a really unnecessarily

long title for a book. That one is by Packt Publishing. I was

fortunate enough to be the technical reviewer on that book.

Again, it’s just full of day-to-day stuff that you would do with

Hadoop from spinning up a cluster in the cloud to bringing

Page 36: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

in data and setting up a table with Hive. And it even goes

into using Mahout, which is another component of Hadoop,

to do machine learning. So if there is something that you

want to do with Hadoop, it’s probably in there.

Kirill: Sounds good. Thanks a lot. So there we go: “Data Wrangling

with Python” and “Hadoop MapReduce v2 Cookbook”. Once

again, thank you very much, Scott. I really appreciate you

taking the time to come on this podcast and share all of this

wisdom and knowledge with us.

Scott: Well, I really appreciate you inviting me to.

Kirill: So there you have it. I hope you enjoyed today’s podcast.

Lots and lots of valuable information. As you could see,

Scott has very in-depth knowledge on the topic and is

definitely in the most advanced frontiers of Hadoop and

everything to do with data in organizations, data lakes, and

big data and all of these things. He was kind enough to come

on the show, spend an hour of his time and share all of

these things with us.

So I hope you took this opportunity to pick up some

additional knowledge and skills. Personally, I definitely

learned a lot as well. My favourite part, of course, was the

breakdown of all of these different aspects of Hadoop, all of

these elements such as Hive, Kudu, Spark, Seahorse. Some

of them I knew about, some of them I’ve worked with, but

some of them I haven’t even heard of and I was very excited

to learn. I can definitely look back and say now I know some

more things about Hadoop, and it looks like this space is

constantly evolving. On that note, if you would ever like to

get in touch with Scott, if, for example, you are an executive

Page 37: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of

or a manager or you own your own business and you’d like

to get Scott to come in and help you with your analytics

strategy, make sure to reach out to him. As you could tell

from this podcast, this is a person with a huge wealth of

knowledge who’s excited to share it, so this is your go-to guy

for anything to do with Hadoop and data strategy.

And with that, don’t forget that you can find all the

resources mentioned in this episode at

www.superdatascience.com/51. There you will be able to

find the transcript for this episode as well as a link to Scott’s

LinkedIn, his company’s page, and his e-mail. Thank you so

much for being here today. I hope you enjoyed the episode. I

can’t wait to see you next time. Until then, happy analyzing.