sds podcast episode 51 with randal scott king€¦ · and he'll give you some tips and tricks...
TRANSCRIPT
![Page 2: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/2.jpg)
Kirill: This is episode number 51 with Global Analytics Consultant
Scott King.
(background music plays)
Welcome to the SuperDataScience podcast. My name is Kirill
Eremenko, data science coach and lifestyle entrepreneur.
And each week we bring you inspiring people and ideas to
help you build your successful career in data science.
Thanks for being here today and now let’s make the complex
simple.
(background music plays)
Welcome to the SuperDataScience podcast everybody. Hope
your week is going great, and today we've got a very
interesting guest. Scott King is an analytics consultant, also
the founder of Brilliant Data, and also a renown analytics
speaker. Scott comes into organisations to help them with
their data capabilities. He helps executives with their
strategy, helps them understand what are the best tools for
their organisation, and which is the best way to proceed
going forward into the future.
And this podcast, we predominantly talked about Hadoop.
So if you've always wanted to find out a bit more about
Hadoop, or really what's going on at the cutting edge of
technology in this space, then this is the podcast for you. In
today's episode, you will learn a lot. And I mean a lot. Get
your pens and papers out, because today, you will learn
about HDFS, MapReduce, Hive, Hog, Pig, Kudu, Impala,
Spark, Seahorse, and what a data lake is, and much, much
more.
![Page 3: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/3.jpg)
And my favourite part is that Scott has a natural ability to
explain these things in a very, very simple manner so you
don't have to be highly technical, you don't have to already
have a lot of knowledge about Hadoop to understand all of
these things. It'll be very easy for you to get up to speed. So
this is your crash course into Hadoop. By the end of this
hour you will know how to operate these terms, or at least
be up to date with what's going on in this space.
In addition, we will also talk about Scott's public speaking
and he'll give you some tips and tricks on how to prepare for
data science presentations, how to present to different types
of audiences, and the things that he does that help him out
when he's doing public speaking.
So all in all, very interesting, exciting podcast. Can't wait for
you to check it out. And without further ado, I bring to you
Scott King, Global Analytics Consultant and Founder of
Brilliant Data.
(background music plays)
Hello everybody and welcome to the SuperDataScience
podcast. Today we've got Randal Scott King calling in from
Atlanta. Randal Scott is a Global Analytics Consultant. How
are you going today, Scott?
Scott: I'm doing very well, how about yourself, Kirill?
Kirill: I'm well as well. And just for the benefit of our listeners, can
you explain the whole legacy of names in your family?
Because your name is Randal Scott, but everybody calls you
Scott. How does that come by?
![Page 4: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/4.jpg)
Scott: I think that's just a thing in my family. We give people first
names and then don't use them. My father is John Richard.
He went by Richard, I think probably because his father was
also John Richard.
Kirill: Ok, so John Richard the junior, right?
Scott: Yeah, exactly, yeah.
Kirill: And you said you have 5 children, are you going to also give
them double names as well? That's 10 names you gotta
think of!
Scott: Well, yeah. So as a matter of fact, yeah. As a matter of fact,
my oldest daughter goes by both of her names.
Kirill: Ok. Alright. Gotcha. Ok, well thanks a lot for coming on the
podcast and so for those of you who haven't heard what
Scott does, Scott is a Global Analytics Consultant and he
consults companies like the Fortune 500 companies in
analytics, in spaces like Big Data, and Hadoop, just to name
a few. And so this is going to be a pretty exciting chat. So
Scott, tell us how you got into this whole space of analytics
consulting.
Scott: I started as a user. I was actually the Director of Business
Development at an IT reseller. And I was responsible for
making sure the company sold $100 million worth of Cisco
product.
Kirill: Wow.
Scott: And I realised one day, I was like, I found on the company's
internal website that somebody would publish the entire
dump from the Oracle ERP that had every transaction year
![Page 5: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/5.jpg)
to date on it. And I realised pretty quickly just how powerful
that was. I spun up an instance of Pentaho. They have an
open source version, but it's a BI package. And I would go
out and get that database dump every day and put it on my
laptop and I started looking at the data and realized “Oh, my
gosh. I can find out everything about the business. I can find
out who’s selling what, where, and to whom.” It was really
useful in my job as business development to know all of that
about the business. We were able to post some really strong
gains year over year for the three years that I was there by
just knowing what was going on.
Kirill: Wow! That’s powerful.
Scott: So, BI is great. It tells you what’s going on with the company
right now and in the past, but it was really when I
discovered machine learning and I was like, “Oh, my God. I
can predict things now.”
Kirill: Yeah, totally. And then you slowly transitioned into the
space of Hadoop or you do both at the moment?
Scott: So once we started doing that, we started going back further
in time. We had something like 8 years’ worth of data.
Obviously, that outgrew my laptop really quickly. (Laughs)
So the light bulb just went off one day and I was like, “You
know what, if we’re getting this much value out of this kind
of stuff, we could be doing this for clients.” So yeah, I left
and went out and started Brilliant Data and started doing
that sort of thing for clients.
Kirill: Okay. That’s pretty cool. Are you consulting on your own or
do you have a team of people that you’re working with?
![Page 6: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/6.jpg)
Scott: I’ve got a small team. We’re a small boutique firm but we’re
getting to be pretty well-known for getting results.
Kirill: Okay. That’s great. I’m actually looking at your LinkedIn
now and there’s so many people leaving comments about
how you presented within companies and on Hadoop and
really shaped other companies’ ways of thinking, so CEOs
and directors are leaving great comments. Can you tell us a
bit more about that? Apart from implementation, you’re
obviously also pushing the envelope in terms of an analytics
agenda and building this analytics culture in different
organizations. How do you go about doing that?
Scott: Well, you know, I’ve always done public speaking in some
form or fashion throughout my career. I’ve been in IT
consulting for about 20 years and have always either taught
classes or gave sales pitches as a sales engineer, any
number of things that have to do with speaking. And I
realized, when I started Brilliant, I was like, “You know, this
is going to be a valuable tool for bringing business in, is
getting out there and speaking and kind of starting the
conversations around things.
These days I see it more as kind of educating people on the
state of analytics and the state of big data because it really
is changing so quickly all the time. In the past, you really
wouldn’t want to ever use Hadoop as a data warehouse
because there were some significant limitations there, but
there’s been some things that have come along just recently
that have kind of changed that now and it’s at least possible,
if not advisable, to do now. A lot of people don’t know that.
You know, you kind of get out there and let people know.
![Page 7: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/7.jpg)
Kirill: Okay, gotcha. So basically, two areas that I want to dig into
further is, of course, your public speaking and your
experience there and also Hadoop and what you can tell us a
bit more about that. Maybe let’s start with Hadoop. In five
sentences or less, what is Hadoop for a person listening to
this podcast who hasn’t encountered Hadoop or just heard it
as a buzz term?
Scott: Well, there’s really two reasons why you would want to use
Hadoop. You’ve just got more data than a traditional
relational database can work with, or you’re wanting to work
with types of data that a relational database can’t work with.
Kirill: Interesting. And how much data are we talking that a
relational database can’t handle? A gigabyte, 10,000
gigabytes?
Scott: No, certainly not in gigs. You’d have to get into the upper
terabytes or even petabytes before you’d really max out most
of the really good relational databases like your Oracles and
whatnot.
Kirill: Gotcha. Does only the amount of gigabytes matter or also
the — you know, people probably have heard of the three Vs
of Hadoop – velocity, variety and veracity, I think, or
variability, maybe 4 Vs. Does it have to be just the volume of
data, or can it also be the different types of data that you
have in your dataset?
Scott: Well, yeah. It’s really those two things. And you’re right,
there are the three Vs, but the two that I think of when I’m
talking to clients and figuring out if they actually need
Hadoop – because a lot of people think they need Hadoop
and they don’t. I told somebody recently — they had SAP,
![Page 8: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/8.jpg)
you know, they were wanting to build dashboards and I’m
like, “You do realize you can do that with what you have,
right?” But the two things that I think of when a client says,
“We need Hadoop,” is “Okay. Do you have the volume of data
necessary for Hadoop?” “Well, no.” “Okay. Do you have a lot
of different kinds of data that you want to work with that
you would find difficult to put into a database and they’re all
like, “Yes.” You see, we’ve got these customer service
transcripts that we want to go through and do text analysis
on them, etc. You know, you’re not going to do that in a
traditional relational database.
Kirill: Yeah, gotcha. Understood. All right, so Hadoop if you have a
lot of data, if you have lots of different types of data. So, how
does this happen? Somebody calls you up and says, “We
want Hadoop.” You ask them the questions and then you
say, “Either you need Hadoop or you don’t.” But if they do
need Hadoop, what happens from there?
Scott: Well, it’s a matter of determining what they have and what
they want to bring into Hadoop and what is the ultimate
business benefit that they want to achieve. I think that’s
really where a lot of Hadoop—there’s all kinds of press these
days about the demise of Hadoop. I think it was Mark Twain
who wrote to his local newspaper, they had published an
obituary for him, and he said “The news of my death has
been greatly exaggerated.” I think the news of Hadoop’s
death has been greatly exaggerated. I think that goes back to
the Gartner hype cycle. You know, we’re in the trough of
disillusionment right now and I’m really looking forward to
the — what is it, the plateau of productivity?
![Page 9: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/9.jpg)
Kirill: Something like that, yeah. So, basically you go in and you
implement the system. How long does it take? What are the
main challenges that you face when you’re implementing
Hadoop at an organization?
Scott: Well, to actually stand up a Hadoop cluster doesn’t take very
long at all. To bring in the client’s data and to do the
groundwork necessary for that to figure out—to sit down
with them and say, “Okay, here’s the problem or problems
that you’re trying to solve” or “Here’s the additional
capabilities that you want to develop,” that takes a while.
Bringing the data in takes a while. You know, people think
of Hadoop as that whole data lake concept of “Oh, just throw
it all in the data lake and we’ll figure it out later.”
Well, you can do it that way, sure, but I wouldn’t advise it.
But sitting down with the client and figuring out what it is
that they’re trying to accomplish, figuring out what data
they’re going to need for that and how to structure that
inside either HDFS or Hive or now if it’s a Cloud area, install
Kudu, which is a great little tool, we can talk about it in
more detail later if you want. But figuring out what exactly it
is that they’re wanting to accomplish and how to set that up
in Hadoop is what takes the longest time. Usually, once the
data is in and it’s in the format that you want, doing
analysis doesn’t really take all that long.
Kirill: Yeah, gotcha. And there are tools like Pivotal, for instance.
It’s a pretty good tool to use on top of Hadoop to make things
easy. They have PivotalR even, to allow for that. Okay, so
you’ve mentioned a couple of words – HDFS, Hive, Kudu.
![Page 10: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/10.jpg)
Could you tell us a bit more? Let’s start getting into the
technical side of things.
Scott: Sure. The central idea behind Hadoop is that it’s a
distributed system, right? So you take a big problem like
analysing terabytes or petabytes of data and you cut it up
into smaller problems and you spread that load out over the
individual servers that make up the cluster and then they do
their part of it and send it back to you and you reassemble
the whole thing on the other side. Hadoop is all about
distributed computing. And HDFS, which stands for Hadoop
Distributed File System, is exactly what it sounds like. It’s a
distributed file system that spans the whole Hadoop cluster
and sits on top of the file system on each node.
Kirill: Okay. So that’s basically the system that connects
everything together?
Scott: That’s one way of thinking of it, yeah. I mean, it’s a file
system just like any other, but it spans the whole cluster
and as a result it can be massive.
Kirill: Okay. And it’s one of the main advantages of Hadoop, is that
it’s scalable, right?
Scott: Yes. You want to make it bigger, you just add machines to
the cluster.
Kirill: Okay. And how does that compare to using a super
computer? Why would one use Hadoop over using a super
computer?
Scott: Super computers tend to be very, very costly whereas
Hadoop uses commodity hardware. I tell people Hadoop is
the exact opposite of virtualization. With virtualization you’re
![Page 11: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/11.jpg)
taking one big physical server and you’re running a bunch of
virtual servers on it. Hadoop does the exact opposite. It
takes a bunch of physical servers and makes one big virtual
server out of it.
Kirill: Okay, gotcha. So what happens when one of those individual
servers or commodity hardware breaks down?
Scott: Nothing. That’s the beauty of it. So HDFS replicates data a
minimum of three times. There will be three nodes that your
data is on. So if one of those nodes takes a dive, let’s say the
power supply sparks and you have to pull it out and replace
the power supply or just replace the entire machine. Well,
when you put it back in, HDFS will actually rebuild the file
system that was on it.
Kirill: Okay. That’s pretty cool. So basically you have three copies
of your data on separate nodes, but are they all being
processed simultaneously, or data is still processed on each
individual node but that is just kept as a backup?
Scott: Well, what HDFS does is let’s say that you have a terabyte
worth of data, it’s some kind of big table or something. What
HDFS is going to do is it’s going to take that and divide it up
into blocks. The size of the block is configurable, but I think
the default these days is a 128 MB. Sometimes you might
want that to be 64 MB, sometimes you might want it to be
256. Like I said, it’s a configurable parameter. HDFS is going
to take that big terabyte file and it’s going to chop it up into
128 MB blocks and it’s going to replicate each block three
times.
So if you have a 20-node cluster, for example, that first
block might be on servers 1, 3 and 5. The second block
![Page 12: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/12.jpg)
might be on servers 2, 15 and 10, etc. So the whole data gets
spread out over the cluster and then the name node, which
is the part of the cluster that decides where things go and
decides who does what, it will pick one of the nodes that that
first block of data is on and say, “Okay, your job is to
calculate this block of data for this job.” Does that make
sense?
Kirill: Yeah, gotcha. And what about MapReduce? Can you tell us
a bit more about MapReduce?
Scott: Well, I kind of just did. (Laughs)
Kirill: Okay. So that is MapReduce?
Scott: MapReduce is the computation engine, or the original
computation engine, for Hadoop. MapReduce is the part that
does the actual calculating anything on the data in Hadoop.
Since Hadoop v2 and with the introduction of something
called YARN, Yet Another Resource Negotiator, there are
additional computation engines like Tez and like Spark that
can run on a Hadoop cluster now, you’re not just limited to
MapReduce. And that’s a good thing because MapReduce
has some significant limitations.
Kirill: Okay, gotcha. All right. Moving on to this whole animal
kingdom of Hive, Kudu, Pig, etc.—
Scott: Oh, my gosh. It grows every day, man.
Kirill: (Laughs) What can you tell us about that? First of all, why
are the names all animals?
Scott: That’s a really good question. I’m not sure, but I do know
that Hadoop itself, that name came from Doug Cutting’s
![Page 13: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/13.jpg)
young son at the time, and this was 10 years ago so that kid
could probably be an adult by now. But at the time, he had
this little toy stuffed elephant, this yellow stuffed elephant
that was named Hadoop and Doug decided to name that
project after that little toy.
Kirill: It’s like kids – inventors or developers on these things and
their kids. Same thing with Pivotal. I think they have a
product called Plum, or it’s a name of the company—
Scott: Greenplum.
Kirill: Greenplum! So, to his child at the time, he was like creating
and said, “Oh, I have a cool idea. What should I call it?” And
the child picks up an apple from the basket of food that they
have at the table and the guy is like, “Well, I can’t call it
Apple. Apple is already taken. What’s the closest thing to a
green apple? A green plum.” So that’s why they call it
Greenplum. It’s just funny. Anyway, Hive – what is Hive?
Tell us about that.
Scott: Let’s say you bring in a bunch of CSV files into HDFS and
you want to be able to run SQL queries on those files. Well,
that’s what Hive does. It creates tables on top of these files
to make them searchable by SQL queries.
Kirill: Okay, gotcha. So, it kind of makes unstructured data
workable through this traditional structured approach.
Scott: It’s kind of an interpreter, really. It’s an abstraction that sits
on top of MapReduce and it translates SQL code into
MapReduce Java and sends that to MapReduce and says
“Hey, execute!”
![Page 14: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/14.jpg)
Kirill: Yeah. Pivotal have a similar thing to that. You can do SQL
on Hadoop through a Pivotal product as well, I just don’t
remember what it’s called.
Scott: I want to say that’s HAWQ.
Kirill: Exactly, that’s right. Another bird, another animal. (Laughs)
Scott: Another animal. (Laughs)
Kirill: I’m not surprised. All right, cool. That’s good. So basically
even somebody not knowing how to work with unstructured
data and do Java code and all those other things, they can
still work with Hadoop through things like HAWQ or Hive.
Scott: Yes. And there’s actually more abstractions on top of
MapReduce. Like Pig you mentioned earlier, it’s a scripting
language and it’s much easier to learn than Java. Again,
basically all it does it takes that scripting language, turns it
into Java and sends it to MapReduce and says “Execute.”
Kirill: Okay, good. That’s good. So we’ve covered HDFS,
MapReduce – this is value – Hive, HAWQ, PivotalR, Pig.
Okay, next one on the line – Kudu. You said that’s
Cloudera’s creation. Tell us a bit more about Kudu.
Scott: Yeah, a lot of people have tried to implement Hadoop in a
way that you would normally only want to do with a
relational database. I mean, Hadoop and relational
databases were never meant to do the same things. But a lot
of people have tried, for example, to deploy Hadoop as a data
warehouse. This is something that I’ve advised people
against in the past because the things that you have to do to
make that work are just [indecipherable 23:11], for a lack of
a better term. The problem is that data in HDFS is
![Page 15: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/15.jpg)
immutable. You don’t change data in HDFS. You either
delete it and overwrite it or you append.
Obviously, that creates some problems as far as trying to
use it as a data warehouse because data warehouses update
information all the time. So what Kudu really does – and I
don’t understand why Cloudera is not marketing it this way,
but they’re not – but what Kudu does is it basically
eliminates the immutability issue. Data in Kudu is
absolutely updatable. You can do updates, you can do
inserts, you can do deletes just like you would on a
relational database.
Kirill: Okay. That makes sense. So how does Kudu do it in terms of
— if HDFS restricts the updating of data, how does Kudu go
around that?
Scott: Well basically it does it by not using HDFS. So, Kudu is
actually a storage engine, you know, they’re very hesitant to
call it a database because really it’s not, but it is a storage
engine that is updatable whereas HDFS isn’t. So you can
have HDFS and Kudu running side by side on a Hadoop
cluster and use them for different things, but Kudu does not
rely on HDFS at all.
Kirill: Okay.
Scott: Hive relies on HDFS. Kudu just has its own thing for that.
Kirill: Okay. How about next in our line-up of animals and
creatures – Spark and Seahorse? What can you say about
Spark and Seahorse?
Scott: Well, going back to what I said earlier about MapReduce,
you know, it was the original computation engine. But in
![Page 16: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/16.jpg)
Hadoop v2 they introduced YARN. I like to call YARN the
mom of the cluster. You know, when you were a kid and you
wanted something or you wanted to do something you went
to mom, right? Because if you went to dad, he would just
say, “Well, what does your mother say?”
Kirill: (Laughs) Totally, yeah.
Scott: But, yeah, I mean, if you and your brother or you and your
sister wanted a cookie and there was only one cookie, mom
was going to be the one who decided who got the cookie,
right? So YARN does the same thing in a Hadoop cluster.
YARN is administering the resources of the cluster and
deciding, if you’ve got a bunch of different jobs running on a
Hadoop cluster, YARN is going to say, “Okay, I’m giving this
job this much processing power, etc.” That used to be in
MapReduce. So, in v1, MapReduce did all that. But they
took that out of MapReduce and put it into YARN so now
MapReduce just does computation. And what that did is it
opened up the window to other computation engines and
Spark is one of them.
Kirill: So Spark goes instead of MapReduce?
Scott: Yeah. And the great thing about it is, because of the way
YARN works, you don’t have to choose either/or. So, for
example, if you’ve got a job that could really benefit from
MapReduce’s bash-oriented nature, which a lot of ETL
processes would qualify, then you can have that job running
in MapReduce while over here you’ve got a different job
running in Spark and taking advantage of Spark’s greater
speed.
Kirill: Gotcha. So advantage of Spark over MapReduce is speed?
![Page 17: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/17.jpg)
Scott: The advantage of Spark is really it’s orders of magnitude
faster than MapReduce. The reason for that is MapReduce is
very disc-oriented and very batch-oriented. It’s just slow.
When MapReduce was invented, they weren’t really all that
concerned with speed. They were just concerned with being
able to do the kinds of things that MapReduce does at all.
So, whereas MapReduce is batch-oriented and disk-oriented
– I mean, it does everything on disk – Spark will actually
load data into memory. And again, it’s just orders of
magnitude faster because of that.
Kirill: Gotcha. All right. And what can you say about Seahorse?
Scott: You know, Seahorse is this brand-new thing that I just
stumbled across maybe a month and a half ago. What
Seahorse does is it actually gives you a graphical front end
to Spark. So Spark has these great machine learning
libraries – ML and MLlib – and Seahorse gives you a
graphical interface to using those. I’m noticing it really
speeds up prototyping for me. I loaded it on my laptop and
interfaced with the cluster here at the office and I’ve noticed
that rather than hand coding everything in Python or Java
or whatever, having that graphical front end enables me to
prototype things so much quicker.
Kirill: Really, they are just punching these out, like one and a half
months ago, and you’re obviously in the centre of everything
that’s going on with Hadoop, how quickly do they release
these things?
Scott: You were saying something about the ecosystem around
Hadoop earlier and how they’re all named after different
animals and all this and I was like, “Yeah, there’s a new one
![Page 18: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/18.jpg)
every day.” You know, again, you could spend all kinds of
time just keeping up with the latest on either Hadoop or big
data or machine learning or data science. I mean, there’s
always something new every week.
Kirill: And how do you do it? How do you keep up with everything?
Scott: Who says I do? (Laughs)
Kirill: Well, you seem to be pretty up-to-date with everything we’ve
discussed so far. You definitely know what’s going on.
Scott: I think keeping current on anything in tech is really a matter
of community. I introduced a friend of mine who’s a data
scientist at a benefits company. I sent him an e-mail and
said, “Hey, have you seen this thing?” I was talking about
Seahorse. “Seahorse is basically a GUI for Spark.” He’s like,
“Whoa, I love this!” And there’s always somebody who says,
“Hey, Scott, have you seen this thing over here?” Sometimes
it’s, “Yeah, I have. I think it’s great.” And sometimes it’s,
“Oh, my God, no! What is that?”
Kirill: It’s a seahorse. (Laughs) Okay. That’s pretty cool.
Summarizing all of these things up together, can you give us
a quick description, what do people mean when they say
data lake? Obviously these tools somehow altogether assist
or facilitate data lakes or are used in data lakes. What do
people mean by the term data lake?
Scott: I think different people mean different things by it. I think a
lot of people use the term and don’t even really know what
they mean by it.
![Page 19: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/19.jpg)
Kirill: All right. Let’s make it clear for everybody so that when
people use it in the future, they know what they’re talking
about.
Scott: Yeah. So, when you think about a lake — a lake has layers,
right? You’ve got the sediment and you’ve got kind of the
murky depths and then you’ve got the top. And really, what
it’s all about in terms of making that useful for an
organization is you’ve got different kinds of data that you
can put in there. You know, earlier we said that thing about
some people talk about how Hadoop is where you can just
throw things and deal with it later. You really can. You can
use Hadoop to archive data and just get it in there and
figure out what to do with it a couple of years down the road
when there’s some use for it. Or you can have it in a Hive
table or a Kudu table for ready access.
I think really the whole data lake concept came about from
having a place where you could put everything, which was
kind of the goal of the data warehouse, but data warehouses
have very strict schema and whatever data you’re going to
put in there has to conform to that schema. With Hadoop
you don’t really have that restriction. You can put anything
in there. And I think that’s really the whole point of the data
lake concept. Yeah, there’s all kinds of stuff that you can put
in there and you can either start using it immediately or
figure out what to do with it later.
Kirill: All right. But how do I imagine it because it’s so different to
what people are used to, whether it’s SQL or Excel or just
folders and files. Can I imagine a data lake as just a huge
folder where I can just create new folders and just dump all
![Page 20: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/20.jpg)
of my videos or all of my texts, scanned documents,
whatever I want? Does that analogy make sense? Do you
think of it as folders containing certain information? Or does
it look like something else to you?
Scott: No, that’s a very apt description of it. I mean, it’s a file
system much like any other, HDFS is. And you can put
almost anything onto it. Like, you mentioned video. You can
put images, you can put audio, you can put texts. Really,
whatever it is that you want to stick on there, you can stick
on there.
Kirill: Okay, gotcha. Thanks for the confirmation. It makes things
a bit more clear now. And with your public speaking, so you
have all this vast knowledge on in-depth topics and you
know obviously how to set these things up and communicate
them well. When you go into a company and you do the
speaking there to the employees and even executives, what
is your main goal? What is your main goal that you’re trying
to communicate to them and what do you want as an
outcome for them at the end of your conversation?
Scott: Well, I mean, it’s really different. If I’m doing public speaking
as a public speaking thing, if they’re bringing me in to do a
30 minute, 60 minute, 90 minute keynote, then it’s a matter
of I’m trying to find out what it is that they want to
accomplish. For example, there was this benefits company
here in the States. They have like 300 Java developers and
they had done a proof of concept with Hadoop and they were
going to move forward with using Hadoop and they wanted
me to come in and address these 300 Java developers at
![Page 21: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/21.jpg)
their annual conference that they have internally within the
company.
And I said, “Okay, what is it really you’re trying to
accomplish with this keynote?” And they said, “Well, we
want to educate them on Hadoop, give them the 30,000 foot
view of how it works and what the pieces, parts are, but
mostly we want them to not be afraid of Hadoop and
understand that their jobs are changing a bit, but they won’t
be changing that much and they won’t go away. And so
that’s exactly how I tailored the speech, was “Hey, guys.
Here’s what Hadoop is. Here’s why the company is going to
Hadoop. Here’s why your future as Java developers is very
secure with Hadoop: a) MapReduce is Java.” If you’re
programming in MapReduce, you’re not using Python or
Scala or whatever. That’s what they were trying to
accomplish. They wanted to kind of reassure these guys and
educate them at the same time. That’s what I tailored the
speech to do.
Kirill: Gotcha. Okay, it makes sense. Can you give us another
example of a different speech where you had to tailor to a
different type of audience?
Scott: Actually, yes. An IT organization asked me to come in and
address the executives in their own company and explain to
them why the IT organization wanted to use Hadoop to
augment the company’s existing data warehouse. You know,
I was talking to the senior executive who wanted me to do
this and I’m like, “Um, couldn’t you do this?” (Laughs)
Kirill: What was his answer to that?
![Page 22: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/22.jpg)
Scott: He said, “Well, yes, but a) I’m not good at that sort of thing,
and b) I think it will carry more weight if it’s coming from a
third party.” (Laughs)
Kirill: Okay.
Scott: So, I was hired by the IT organization to convince executives
to go in a direction that the IT group wanted to go.
Kirill: Okay, definitely very different. I see how that’s working. So
what’s your biggest challenge that you face when speaking
to whether it’s executives or large audiences like that? Is it
hard to get the message across about Hadoop?
Scott: Not too hard, no. I’ve always had kind of a knack for taking
something that’s really complicated and breaking it down to
its essence and explaining that in a way that most people
can understand. To me, the most important thing is to make
sure that I understand what it is they’re trying to accomplish
with the speech – that’s really the biggest thing – and to
work with them and make sure that “Okay, what does
success look like at the end of this speech? What’s going to
happen? Who’s going to do what? What information is going
to get conveyed? Is there a skill that you want them to
develop? Is there an idea that you want to plant in their
head? What does success look like? How is life different after
this speech?”
Kirill: Okay, gotcha. Just on that, I wanted to ask you as well, for
people out there who are listening and who want to get into
speaking, or public speaking, on data-related topics,
whether it’s data science, machine learning or even Hadoop
like yourself, what would your biggest one piece of advice be
for them? Because obviously data science is a very kind of
![Page 23: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/23.jpg)
topic where you don’t interact with people that much, where
you’re very technical and going into speaking can be a big
shift for somebody. What would your best advice for a
person like that be?
Scott: I would probably say speaking is a skill just like any other
skill, like riding a bike or like coding in Python. You learn it
by doing it. If you haven’t done much speaking before,
probably the best thing to do is find opportunities to speak. I
hate to use the old cliché about “Go join Toastmasters,” but
people recommend that because it works.
Kirill: Okay, gotcha. And yourself, how did you get started into
speaking?
Scott: Oh, gosh. We had a piano in our house when I was growing
up. I grew up on a peanut farm in southern Alabama, so
there wasn’t a whole lot to do. I mean, there was plenty to
do, but there wasn’t a whole lot to do that was very
entertaining. At 6 years old, I started planting myself in front
of the piano and tinkering around on that and eventually
taught myself how to play piano. I want to say I was 12
when I started playing the organ at the little church around
the corner that had 70 people in it. And so from a very
young age I was used to being in front of people and I don’t
think that I ever had the opportunity to develop stage fright.
Kirill: Okay, so it’s pretty lucky circumstances that you put
yourself into.
Scott: Right. And speaking in front of people isn’t much different
from playing a musical instrument in front of them, really.
![Page 24: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/24.jpg)
Kirill: Okay, gotcha. When you’re speaking to lots of people – I’m
just curious about this for myself – who do you focus your
attention on? Do you look at one person, or do you move
your eyes around the room to make everybody feel included?
Scott: Oh, you’ve got to move around the room, sure. I mean, if you
look at one person the whole time you’re speaking, you’re
going to accomplish one of two things. You’re going to make
everybody else feel kind of disconnected from you or you’re
going to really freak that person out.
Kirill: (Laughs) Yeah, they’d be like, “Why is he staring at me all
the time? I’m tired of his gaze.”
Scott: Yeah. So I kind of let my eyes wander the room as I’m
talking and kind of gauge different people’s reactions to what
I’m saying. Of course, the thing that you never want to see is
that person with half-shut eyes looking like they’re about to
nod off. (Laughs)
Kirill: Yeah, gotcha. That’s kind of like the next thing I wanted to
ask you. How do you structure your presentations in such a
way that people don’t fall asleep? Because a very common
flaw of technical presentations is that they’re either
overpopulated with formulas or overpopulated with text or
just the way the presentation is flowing just makes people
nod off and they can’t pay attention for more than 8 or 12
minutes. What’s your trick to that?
Scott: Well you have to know the audience, right? There’s nothing
wrong with a speech that’s full of technical information if
you’re speaking to 300 Java developers. There is everything
wrong with a speech that’s full of technical information if
you’re talking to a room full of business executives.
![Page 25: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/25.jpg)
Kirill: Okay, gotcha. You want to keep it nice and sweet and short
for the business executives.
Scott: Well, keep it high level for sure. You know, don’t get into the
weeds with those kinds of guys. And be sure to make sure
that whatever it is you’re talking about this technical subject
relates back to some sort of business problem or business
result or something that they can relate to.
Kirill: Okay, gotcha. These speaking events, how do they link up
with the work you do at Brilliant Data? Do they detract from
your time that you could be spending implementing
Hadoop? Or do they facilitate that and actually help you
bring on more work into the company or better deliver the
projects that you're delivering by then following up with
some speaking events?
Scott: Well, I’m a consultant who speaks, not a speaker who
consults. Whenever I go and give a speech somewhere, it’s
always coming out of an experience that I had with a client
somewhere. And sometimes those experiences are kind of
humorous. That’s always great really because humour is a
necessary ingredient for public speaking. We were talking
earlier about keeping people’s attention and humour is great
for that. It also makes you seem a lot more human when
you’re on stage if you relate a funny story.
Kirill: Yeah, and approachable. Can you give an example of a
funny story you’ve used recently?
Scott: So, I started my career really as a network engineer, not
even really doing data. I was working for Cisco at the time
and they sent me out to this switch site for a major service
provider. They were having all kinds of errors. And when I
![Page 26: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/26.jpg)
got there, the facility manager seemed very amused that I
was there. And so I said, “Yeah, I’m here to troubleshoot the
problems with this equipment.” He said, “Oh, yeah. I know
where that is. We’ll go up to the top floor.” So we went up
there and he showed me the routers that were having the
trouble and I said, “Jimmy, what is this bright film on top of
the equipment?” He goes, “Oh, that’s where the roof leaks.” I
said, “Leaks?! As in this is still going on?” He says, “Oh,
yeah. I’ve been sending e-mails trying to get permission to
get the roof fixed for like 6 months now.”
Kirill: (Laughs) Well, there is your problem, right?
Scott: Well, yeah. And so, you know, they spent whatever they
spent. I’m sure it was not cheap to have Cisco send me out
there to troubleshoot this when they really could have
figured it out just by calling the facilities.
Kirill: Yeah, okay. Definitely. I see how that can make people a bit
more happy in your presentations and lighten the mood up.
That’s a pretty cool story. Moving on to some other questions
that I have for you, what is a recent win that you’ve had in
your career that you can share with us, something that
you’re most proud of?
Scott: You know, I would have to say it’s been about a year and a
half ago now. Packt Publishing reached out and said, “Hey,
you know, we really want to do a training course on Hadoop
and we think you’re the right guy to do that.” I said, “Okay,
yeah. That sounds like a lot of fun.” They give you this
compressed six week time period to get this thing done. And,
you know, I know what I want to say, I know how to do this
thing that I’m about to demonstrate, but getting it recorded
![Page 27: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/27.jpg)
and edited and put together into something that looks good
and sounds good is not easy and it’s very time-consuming,
whereas I thought, “Oh, yeah, six weeks. This will be done in
three.” No, it wasn’t. I think it was done in six weeks and
three days. (Laughs) But I think the end result was
something that I’m pretty proud of.
Kirill: That’s nice. So you have a course which people can take. So
our listeners, if they’re interested they can find your course.
Where can they find your course?
Scott: Well, they can go to packtpub.com or it’s also on the O’Reilly
site, oreilly.com. It’s called “Learning Hadoop 2”. There is a
book by that title and there is a video course by that title
and I did the video course.
Kirill: Okay, interesting. And how did you find creating videos?
Was it very different to public speaking and was it very
different to other forms of education that you’ve done before?
Scott: Well, it’s actually quite different from public speaking
because if you make a mistake you get to go back and fix it.
Kirill: (Laughs) So it’s better?
Scott: Oh, yeah. You know, you kind of think about what it is you
want to demonstrate and what you want to say about it
ahead of time. You know, if you’re going along and you kind
of trip over your words or you say the wrong thing or you
stub your toe on the leg of the table and you accidentally
curse, like I did one time, you just go back and you take that
out. Nobody has to know.
Kirill: Totally. Okay, next question is what is your one biggest
challenge that you’ve ever faced in your career?
![Page 28: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/28.jpg)
Scott: Well, I’ll tell you. Leaving where I was, which was a very
comfortable position, I had a great boss, I had a great team
that I was working with. But leaving that and going out and
starting your own thing is very difficult. They say the three
hardest addictions to overcome are heroin, carbohydrates
and a monthly salary. Now, I don’t have any experience with
the first one, but I have lost 20 pounds before and I have left
a monthly salary and a very comfortable job to start a
company before. And those two are very hard things to go
without.
Kirill: Okay, gotcha. And was it tough the first few months or even
years when you left?
Scott: Oh, gosh, yeah. The first year was—let’s just say it was very
educational. (Laughs)
Kirill: Yeah, totally. I find it’s like nothing else. You learn so much
while you’re building a company, building a business,
building a team. How’d you go about building a team, by the
way? How long did it take you to hire your first employee?
Scott: Well, it took a while before I needed an employee. As far as
like starting the company, there’s this guy, Alan Weiss,
that’s done all these books on consulting and running a
consulting practice and all that. Something that he says is,
“When you’re 80% ready, go ahead and pull the trigger and
figure out the other 20% as you go.” I thought I was 80%
ready and then I realized after a year of looking back on that
year, I thought, “Oh, you know what? I was really only about
50%.”
Kirill: (Laughs) Yeah. But you pulled it off. That’s good.
![Page 29: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/29.jpg)
Scott: Yeah, it took a while. My first employee was actually a
virtual assistant, really is what she was. She’s here in town,
but we met in a Starbucks around the corner from my house
here. You know, she started up a conversation and we’re
talking, and she asked what I was doing and I said, “You
know, I’m looking through these websites and I’m going to
hire a virtual assistant.” And she said, “Actually, I’ve been
out of work for about two years now. I had a baby and
stayed home and all that. I’m looking to get back into
working again and I’d be very interested.” So that’s kind of
how that whole thing came about. I told her, “Okay. Here’s
what the job would involve and here’s what the hours would
be and all that,” and it just kind of went from there.
Kirill: And how big is your team now?
Scott: I’ve got two salespeople, there’s myself, there’s a team of
subcontractors that I’ll use for various things, for example,
for user interface, design and anything graphical. There’s a
guy that I’ve known for 10 years and he’s probably one of the
top 10 in the country at that sort of thing.
So when I’ve got a project that’s big enough size that I need
to bring people in, I know who to call for that. And we’re
about to bring on a couple of interns, actually. I’m extremely
fortunate to be headquartered in Atlanta. Georgia Tech is
here and they have an extraordinary program for data
science and for machine learning. Some of the things they
do there are just incredible. We’re bringing on a couple of
interns who know data science and want to learn Hadoop.
Kirill: Okay, good. It’s like a win-win.
![Page 30: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/30.jpg)
Scott: Absolutely. You know, if you’re going to do an internship, it’s
got to be win-win. There’s got to be something in it for them.
Kirill: Okay, cool. Next question is what is your one most favourite
thing about being in the space of data?
Scott: I’d have to say it’s just the diversity of things that you can
do, right? I mean, every company has data. It really doesn’t
matter what their business is, it generates data. I’ve been
fortunate enough to work with Fortune 500 companies and
mid-market companies and right now one of my clients is a
manufacturing company that makes plastic and metal
containers. Every engagement you do as a consultant in this
industry, you learn something, you know. Sometimes I learn
as much as the client I’m working with. Of course, I don’t tell
them that because then they’ll try to—
Kirill: (Laughs) That’s the best. Oh, my God, yeah. I totally
understand. I can completely relate to that because the
space is growing so quickly and the amount of new
technology and methodology that’s coming out all the time is
immense. It’s just impossible to know everything. Inevitably,
you’re going to be learning. Like, when I create a course, I
know some things. But some things I learn as I’m doing the
research for the content I’m creating. I’m sure you were the
same when you were creating your course. You definitely
had to do some research and come across some things that
were just coming out at the time.
Scott: And that’s one of the best ways to learn, I think, to be
honest with you. If you approach a topic that you know
something about and you’re like, “Okay, I need to really get
in-depth on this topic and really understand it,” I think one
![Page 31: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/31.jpg)
of the best ways to really accomplish that is to think in
terms of, “Okay, if I was going to make a course about this,
how would I structure this and what would I say?” That was
really how I learned how to do machine learning in Spark, is
because I’ve got a request to put together a course about it
and I said, “Sure, yeah. Okay.”
Kirill: “I’ll learn it.”
Scott: And I was like, “Oh, crap. Now I have to learn Spark.” You
know, that wasn’t too much of a challenge because I already
knew Python. The thing is, if you know Python and Pandas
and scikit, you will know how to use Spark with Python. And
that’s one of the great things about Spark that I forgot to
mention earlier, is that if you know either Scala or Java or
Python or even R, then you can use those in Spark.
Kirill: That’s pretty cool, yeah. I’ve personally—I haven’t done it
through Spark but I’ve used R on Hadoop through PivotalR.
That was pretty good. And there’s another one – MADlib.
Have you heard of that one? I think it’s a Greenplum or a
pivotal development just for Hadoop, a mathematical
package that you can apply as well.
Scott: Yes, I’ve heard of it. I know what it is, but I can’t say that
I’ve ever worked with it.
Kirill: Yeah, there’s just so many tools. You can’t work with
everything for sure.
Scott: Gosh, yeah. I mean, if you know how to code in R, then 90%
of that will translate over into using R inside Spark to do
machine learning.
![Page 32: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/32.jpg)
Kirill: Yeah, gotcha. By the way, Tableau on top of Hadoop – how
do you guys do that?
Scott: It’s not really a whole lot different from using Tableau with
any other data source. It depends on what your engine is,
and I hate to use that term. For example, in a Cloudera
cluster, Impala would be the SQL engine that you would
want to use because Impala is just orders of magnitude
faster than Hive. But you would just point Tableau to the IP
address of one of your data nodes and point it to port 21050,
which is the port that Impala listens on. And then from there
it’s just like you do with any other ODBC/JDBC connection
that you set up for Tableau.
Kirill: Okay, gotcha. So it’s possible, basically? Short answer is it’s
possible—
Scott: Yes. Short answer is, “Oh, yeah. It’s absolutely possible.”
Kirill: Okay, gotcha.
Scott You have to understand, if you’re working with immense
datasets, that those aren’t going to run as fast as smaller
ones and you don’t want to package them into a workbook.
(Laughs)
Kirill: Gotcha, yeah. Yeah, for sure. (Laughs)
Scott: What we did for a client recently is, you know, we set up
Tableau Server to have a live connection into Impala because
it was a Cloudera cluster. And then Desktop just connects
into Server to get the data and that ran pretty quickly. Didn’t
have a whole lot of problems with it because the datasets
weren’t really that immense. But even if you’re dealing with
just tremendously huge datasets, there’s solutions for that
![Page 33: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/33.jpg)
like AtScale, for example. AtScale is a nifty little tool. What it
does is it lets you create dimensional cubes on top of data in
Hadoop, but then it’s also got this adaptive cache, so it
learns what data you access the most and it will actually
cache that into memory. So it speeds things up
tremendously in terms of Tableau.
Kirill: Nice. That sounds like pretty solid foundation for Tableau.
Wrapping up this podcast, I’ve got another question for you,
like a visionary type of question or a philosophical one.
Obviously you’re deep into the space of Hadoop and data
and data science, machine learning. From where you’re
sitting and what you’re seeing, where do you think this
whole field of data science is going? And what should our
listeners look into to prepare for the future?
Scott: Let me narrow that down just a bit and pontificate on the
future of Hadoop, if you don’t mind.
Kirill: Sounds good.
Scott: I’m sure you’re familiar with the Gartner hype cycle.
Kirill: Yeah.
Scott: You know, we’ve gone through that peak where everybody
was just all about Hadoop, you know, “We’ve got to have this
and we’ve got to put it to use.” Now we’re kind of in that
trough of disillusionment where it seems like every time I go
online, I’m reading an article somewhere about the death of
Hadoop. I don’t think Hadoop is dead at all. I mean, I think
that whole idea is actually kind of silly. But, yeah, we’re
definitely in the trough of disillusionment, and then
afterwards comes the plateau of productivity, when
![Page 34: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/34.jpg)
everybody has calmed down and realized that it’s not the
cure for everything, but at the same time, this guy is not
falling either. And you know, we settle in and we get some
work done.
I think that’s where the future of Hadoop is, is that
companies are going to start realizing what Hadoop is
actually useful for. There is going to be enough people with
the skills to do things in Hadoop and we’re going to actually
start getting things—it’s not just going to be the Fortune
500s who are able to extract value out of Hadoop. Pretty
much anybody will be able to.
Kirill: Gotcha. That’s some good advice. People listening to this,
don’t get afraid that Hadoop is dying. It’s not dying.
Everything will be okay. Just plough along and it’s going to
be an exciting space to be in. In any case, all these skills
that you learn, they’re so transferrable to whatever. Even if
something does replace Hadoop, it’s not going to be that
much different. You will be able to transfer all these skills
anyway.
Scott: And that’s another thing, too. It seems like every time I turn
around, somebody is comparing Hadoop to Spark, which I
think is kind of funny because every time I’ve ever seen
Spark, it was running on top of Hadoop. While it is possible
to set up a cluster that runs just Spark, the reality is that
hardly anybody does it that way.
Kirill: Okay. Thanks for the insights. Thank you very much, Scott,
for coming on the show. It was very exciting, sharing all that
knowledge. From the surveys that we run and from I know
about our audience, we actually have quite a few executives
![Page 35: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/35.jpg)
listening to this podcast and quite a few managers and
business owners even. If any of them ever want to get in
touch with you to invite you to do a public speech maybe at
their organization or help install Hadoop or do some
consulting work, where is the best place they can find you?
Scott: Probably e-mail, I would say. It’s [email protected].
Kirill: Gotcha. All right, we will definitely include that in the show
notes as well. People listening out there, if you need to
contact Scott, you have his e-mail now. And one last
question I have for you is, what is one favourite book that
you can recommend to our listeners to help them become
better at what they do?
Scott: I'm going to up that and say two.
Kirill: Sounds good.
Scott: You know, probably 80% of any job that you do with data is
getting the data into the usable format, something that you
can work with. O'Reilly has a great book for that called “Data
Wrangling with Python.” That book probably has 80% or
90% of the stuff that you’re going to do on a day-to-day basis
with data. It’s all about cleaning data, scraping data off the
web. I mean, if there is some technique or method for
preparing data, it’s probably in that book.
And the second one is Hadoop-specific and that’s “Hadoop
MapReduce v2 Cookbook” which is a really unnecessarily
long title for a book. That one is by Packt Publishing. I was
fortunate enough to be the technical reviewer on that book.
Again, it’s just full of day-to-day stuff that you would do with
Hadoop from spinning up a cluster in the cloud to bringing
![Page 36: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/36.jpg)
in data and setting up a table with Hive. And it even goes
into using Mahout, which is another component of Hadoop,
to do machine learning. So if there is something that you
want to do with Hadoop, it’s probably in there.
Kirill: Sounds good. Thanks a lot. So there we go: “Data Wrangling
with Python” and “Hadoop MapReduce v2 Cookbook”. Once
again, thank you very much, Scott. I really appreciate you
taking the time to come on this podcast and share all of this
wisdom and knowledge with us.
Scott: Well, I really appreciate you inviting me to.
Kirill: So there you have it. I hope you enjoyed today’s podcast.
Lots and lots of valuable information. As you could see,
Scott has very in-depth knowledge on the topic and is
definitely in the most advanced frontiers of Hadoop and
everything to do with data in organizations, data lakes, and
big data and all of these things. He was kind enough to come
on the show, spend an hour of his time and share all of
these things with us.
So I hope you took this opportunity to pick up some
additional knowledge and skills. Personally, I definitely
learned a lot as well. My favourite part, of course, was the
breakdown of all of these different aspects of Hadoop, all of
these elements such as Hive, Kudu, Spark, Seahorse. Some
of them I knew about, some of them I’ve worked with, but
some of them I haven’t even heard of and I was very excited
to learn. I can definitely look back and say now I know some
more things about Hadoop, and it looks like this space is
constantly evolving. On that note, if you would ever like to
get in touch with Scott, if, for example, you are an executive
![Page 37: SDS PODCAST EPISODE 51 WITH RANDAL SCOTT KING€¦ · and he'll give you some tips and tricks on how to prepare for data science presentations, how to present to different types of](https://reader035.vdocuments.site/reader035/viewer/2022062922/5f06fe3f7e708231d41ac2ea/html5/thumbnails/37.jpg)
or a manager or you own your own business and you’d like
to get Scott to come in and help you with your analytics
strategy, make sure to reach out to him. As you could tell
from this podcast, this is a person with a huge wealth of
knowledge who’s excited to share it, so this is your go-to guy
for anything to do with Hadoop and data strategy.
And with that, don’t forget that you can find all the
resources mentioned in this episode at
www.superdatascience.com/51. There you will be able to
find the transcript for this episode as well as a link to Scott’s
LinkedIn, his company’s page, and his e-mail. Thank you so
much for being here today. I hope you enjoyed the episode. I
can’t wait to see you next time. Until then, happy analyzing.