algorithms by the dozen - monest.net
TRANSCRIPT
counting sort
you are here � 1
1 s o r t ing th ings out
Better searching through sorting. Computers would be of no
practical use if we couldn’t retrieve all that data we have put into them. Finding and
retrieving information, just as in a non-digital world, can be done far more efficient
when things are organized. Imagine looking up someone in a phone directory ordered
by phone number instead of by name.
Sorting them by back number
was a great idea. I never
could tell them apart from
behind.
Algorithms by the Dozen
(c) Dirk Vandycke - 2012
sorting things out
2 Audition Chapter
Just to make the spreads line up in our
word processor software making (two-
sided) printing unnecessary
So no trees need to be harmed in the
evaluation of this audition chap
this page intentionally left blank
(c) Dirk Vandycke - 2012
counting sort
you are here � 3
That’s true.
But since good wine tends to improve with
age, let’s sort them bottles by how old they
are.
Yeah … great idea …
… we could use the quicksort
algorithm here … being it
the fastest anyway
What do you mean?
We could do that … however …
… there might be a slightly easier and faster way to
do this. Imagine reframing the problem a bit,
looking at it as a counting problem instead of a
sorting problem.
We could count the bottles for every age we
encounter
that way we are sorting them as well. Let’s find out
how …
… meanwhile at Château
d’Inhéritance …
(c) Dirk Vandycke - 2012
sorting things out
4 Audition Chapter
Now, counting sort, simply called count sort by its friends, is the algorithm we
want to have a look at here. Although it is a fully fledged, fast and easy to
implement way of ordering our data, its primary focus is actually on counting,
not on sorting. Hence its name. Let’s dive into the nitty gritty of counting sort.
Here’s what the steps of the algorithm look like:
First we have to list up every age we might encounter, in ascending
order.
Next, we’ll go over every wine bottle asking it for its age.
While doing that, we’re going to keep a tally of each age we encounter.
We’ll have to make sure we can keep records of this when we
implement the algorithm in a programming language. But for now we
can just count in fives, on a piece of paper.
When done, all that’s left is going over the list of ages, writing down
every age that was counted at least once. Et voila, we have a sorted list
of all ages.
1
2
3
4
Giving us 0, 1, 2, 4, … (no 3!). That’s pretty
straightforward, isn’t it?
(oh, zero years old meaning less than one year)
(c) Dirk Vandycke - 2012
counting sort
you are here � 5
O-kay. And exactly how does all
this relate to sorting, again? As I
see it, we are just counting stuff!
Well, you’re right, we’re not just there yet
We now have to switch gear and implement the
whole algorithm. For now it may seem as if we’re
just holding the number of wine bottles for each
individual age. This having nothing to do with
actually sorting our bottles according to age. But ..
… our data IS sorted already!
Take a close look at our counting table. The data
we wanted to sort is written down ascendingly. So
our wine ages always will be sorted while
counting.
frequency
(c) Dirk Vandycke - 2012
sorting things out
6 Audition Chapter
5 8 1 0 3 f(k)
0 1 2 3 4 k
A frequency table or histogram holds the
number of occurrences for each possible
value in your data.
A frequency table is like a an array of counters. One for every value we encounter
in the data. Such tables are frequently used in statistics and even in artificial
intelligence. We certainly will see more of them in this book.
8 bottles of 1 year old
wine were counted here
Frequency
tables very
often get
visually
represented
as bar
charts
(c) Dirk Vandycke - 2012
counting sort
you are here � 7
0 1 2 3 4 n
0 0 0 0 1 0
0
1 2 3 4 99
Here we assume that our original array just holds the ages of all wine bottles.
How would this code change if we instead had an array of WineBottle objects to
begin with?
4
Now that we know the sorting count algorithm, how about implementing it in a
real programming language. Let’s go over the steps of the algorithm once more
and see if we can explain this to Mr. JVM.
1 2 4 7 4
The first bottle, being
4 years old, got
counted here
(c) Dirk Vandycke - 2012
sorting things out
8 Audition Chapter
We had four steps in our algorithm. Let’s go through them again, trying to write the
corresponding code. Let’s start from an array of ages we will call wineAges.
First we had to list up every age we might encounter, in ascending order. This
translates in defining a frequency table and initializing all counters to zero.
int[] wineAgeFrequencies = new int[100];
for (int i = 0; i < wineAgeFrequencies.length; i++) wineAgeFrequencies [0] = 0;
Next, we were supposed to go over every wine bottle asking it for its age. That
sounds like a job laid aside for a for-loop.
for (int i = 0; i < wineAges.length; i++) wineAgeFrequencies[wineAges[i]]++;
And while doing that, we were going to keep a tally of each age we encounter.
Meaning we’ll have to increase the frequency for that age.
When done, all that’s left is going over the list of ages, writing down every age as
many times as it was counted. We finally end up with a sorted list of all ages.
int k = 0; for (int i = 0; i < wineAgeFrequencies.length; i++)
for (int j = 0; j < wineAgeFrequencies[i]; j++) wineAges[k++] = i;
1
2
3
4
Most modern languages (like Java) initialize all cells to zero automatically.
What’s with this k variable???
(c) Dirk Vandycke - 2012
counting sort
you are here � 9
First there’s the line counting each age’s frequency
wineAgeFrequencies[wineAges[i]]++;
Which we, for clarity reasons, could restate as
int wineAge = winAges[i]; wineAgeFrequencies[wineAge]++;
Secondly, we have the way the array is finally rewritten ascendingly.
int k = 0; for (int i = 0; i < wineAgeFrequencies.length; i++)
for (int j = 0; j < wineAgeFrequencies[i]; j++) wineAges [k++] = i;
for every value counted
(i.e. for every different wine age)
… … as many times as we counted it (as given by its frequency) …
There are two bits of code in our implementation that we’ll have to look at
carefully to fully grasp what they are doing. Let’s look at that code, up close.
Add one to the count of that particular age
… write the value (age) down in the original array. We keep track of
where we are writing in the array by a good old fashioned counter, k
We increment the counter every time after it’s been used
(c) Dirk Vandycke - 2012
sorting things out
10 Audition Chapter
Q: Why do we need yet another sorting
algorithm? Aren’t we complicating
things?
First of all, because it’s one of the fastest
ways of sorting. But it’s also a very insightful
example of how reframing a problem can
lead to a more efficient solution.
As a nice bonus, this sorting algorithm can
easily remove duplicate values.
Last but not least it is a ‘stable’ sorting
algorithm as opposed to all others we saw.
Stable sorting will come in handy when we
look at radix sort.
Q: How do we know it’s one of the
fastest ways of sorting?
Because, unlike quicksort, insertion sort and
merge sort, it’s not a comparison sort. Not a
single comparison between elements
occurs. It handles sorting by counting.
Hence its name: counting sort.
When sorting, you have to at least look at
your data once. Counting sort just runs over
all data, looking only once at each element.
This is also called sorting in linear time. The
time needed to sort all values is linear to the
number of values.
Q: Then why bother using the other
sorting algorithms?
Counting sort uses its values as indexes. So
values sorted have to be non-negative
integers (a technical constraint in a lot of
programming languages). There’s a
workaround to this by scaling and shifting
the range of values to a range of integers.
However counting sort’s performance
deteriorates quickly from linear time and is
caught up by quicksort as soon as the
number of possible values (and with it the
length of the frequency table) starts to
outweigh the total number of values to be
sorted (i.e. length of the original data array).
Finally, count sorting isn’t an in-place
sorting algorithm.
Q: Aren’t we using more memory this
way?
Of course. We need a frequency table. But
the length of this array only depends on the
range of values. It’s indifferent as to the
total number of values. As is very often the
case with algorithms, we can, to a certain
extent, exchange speed for memory. So
here we have a faster algorithm. More
memory is the price paid for this.
It also illustrates the relation between data
structures and algorithms. Better data
structures (in the form of an additional
frequency table, here) lead to less complex
(and often even faster) algorithms.
(c) Dirk Vandycke - 2012
counting sort
this is a new chapter � 11
0 1 2 3 4 n
Wait just a minute. We are
actually sorting wine ages, not
wine bottles!
Well, you sure got us there
By sorting the wine ages we could reuse the same
array, making it an in-place sorting algorithm.
But as soon as we start using objects. This solution
simply isn’t going to cut it anymore.
We don’t like to admit it but sorting objects with
counting sort is even tougher then it seems. But
we can still pull this off!
1 2 4 7 4 4
(c) Dirk Vandycke - 2012
sorting things out
12 Audition Chapter
Let’s go over our code dealing with this new array of WineBottle objects en see what
needs to change. Since step 1 doesn’t get affected we’ll take off at step 2 (and step 3).
for (int i = 0; i < wineCellar.length; i++) wineAgeFrequencies[wineCellar[i].getAge()]++;
So we’re off the hook, right? Unfortunately it’s not that easy. Look at step 4 again.
int k = 0; for (int i = 0; i < wineAgeFrequencies.length; i++)
for (int j = 0; j < wineAgeFrequencies[i]; j++) wineAges[k++] = i;
The first problem here is wineAges. We should change that to wineCellar.
However that’s when our biggest problem pops up and destroys everything we worked
for. We can’t put ages where bottles are expected. What’s more, just by knowing the
order of the ages how can we figure out where bottles will belong in their sorted
order?
3 2
4
incompatible types
required: WineBottle
found: int
wineCellar
(c) Dirk Vandycke - 2012
counting sort
you are here � 13
will counting sort
ever be able to keep
up with its
promises?
watch for the upcoming new
title in the Head First series
HF Algorithms &
Data Structures
(c) Dirk Vandycke - 2012
capturing taste
this is a new chapter � 1
2 captur ing taste
Recommendation is everywhere. Recommendations and ads are
issued billions of times a day on the net. Recommending movies, books, e-commerce
products and services … and even people. Want to know how Google, Amazon and
dating sites are making highly accurate suggestions to their customers? Looking to
write your own recommender system? Is a customer going to buy your new product?
Diving into this chapter will get you those answers.
Oh my. What a perfect
match we are. Thanks to the
collaborative filtering on the
dating site we were using.
Coming highly recommended
(c) Dirk Vandycke - 2012
capturing taste
2 Audition Chapter
Just to make the spreads line up in our
word processor software, making (two-
sided) printing unnecessary
So no trees need to be harmed in the
evaluation of this audition chap
this page intentionally left blank
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 3
The union of food critics told their
members to lay aside work because of
the growing number of places to cover.
But hiring more of them to reduce
their load, really would be too
expensive.
Over the years, the “Epicure Restaurant Guide” has made quite a name for itself as
the leading source for gastronomic advice with the most complete coverage
available on places to eat, worldwide. Their critics visit thousands of restaurants
daily, reporting their experiences through the famous guide.
To cope with the fact that there’s only one print a year of the guide, Epicure
company has recently established web presence to put up-to-date review reports
online. They also enabled users to upload their personal reviews and ratings. The
sky seemed to be the limit.
But there’s a cloud in this bright sky …
I was told that all those ratings we’re
gathering from our users could be turned
into useful recommendations. But I’m a bit
skeptic about the whole idea. Perhaps you
can help us out?
and health concerns!
(c) Dirk Vandycke - 2012
capturing taste
4 Audition Chapter
Dorsia Gusteau's Planet
Arraywood
Half Moon
Utility Restaurant
Daisy 1 3 3 0
Donald 1 4 - 1
Patrick 3 1 2 1
Jean - 2 - 4
Sylvester 3 - 4 3
We’ve already taken a look at the website for
you. It seems visitors can rate restaurants
on a scale from 0 to 4 stars.
They also can enter data of new restaurants
that aren’t reviewed by Epicure’s experts yet.
We already went ahead of ourselves and
asked for a sample data set of ratings from
Epicure. And they just got back to us.
Think about this before you turn the page.
Which restaurant would you recommend to
someone else from this sample? Why?
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 5
Dorsia Gusteau's Planet
Arraywood
Half Moon
Utility Restaurant
1 3 3 0
1 4 - 1
3 1 2 1
- 2 - 4
3 - 4 3
Seems pretty straightforward to me. We should
figure out each restaurant’s average rating. Then
we’d be able to recommend the restaurant with the
highest average score first. Like this ...
2 2.5 3 1.8
5
So ‘Planet Arraywood’ definitely would be the
best recommendation to go within this case.
With Gusteau’s being runner up.
(c) Dirk Vandycke - 2012
capturing taste
6 Audition Chapter
Sad but true
Average ratings only account for the ratings, not for
where they came from. You may have noticed that we
didn’t use all of the data we were given.
So recommendations will end up to be exactly the
same for everyone. They will never be personalized.
Which is too bad, because restaurants are all about
taste. And taste is very personal.
Wait just a minute there! That way, every
visitor gets to see the same recommendation?
I visited ‘Planet Arraywood’ and it’s not uhm …
exactly my ‘taste’.
Could you come up with some disadvantages of using the
average rating of restaurants to make recommendations?
Everyone gets the same recommendation
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 7
We need to include taste
When we average ratings, we are, in fact,
averaging people’s opinions. Someone giving
a high score to a restaurant, definitely must
like that restaurant more than someone giving
it a lower rating.
So getting an average rating of people who
have visited the restaurant, like we did, still
seems like a good start. The problem with a
simple average is that everyone’s taste is
being put on the same footing.
But people can get better recommendations
from people with a similar taste to theirs. So
we are going to put a higher weight on ratings
from people having a closer taste. What we
need is a weighted average.
Everyone gets the same recommendation
Could you come up with some disadvantages of using the
average rating of restaurants to make recommendations?
a new restaurant might not have ratings yet, even if it’s good good on average doesn’t have to mean good for everyone an average rating doesn’t take into account who gave the ratings ratings are subject to taste, while an average is purely objective every rating just has the same ‘impact’ on the average the best restaurant on average may get very crowded
So, averages won’t cut
it. I would love to see
you do better.
(c) Dirk Vandycke - 2012
capturing taste
8 Audition Chapter
Head First: Welcome, Mr. Average. How are you.
Weighted Average: Fine, thank you. But I do prefer
you call me Weighted Average, if you don’t mind. I
really do come from a large family. And my very
popular cousin Simple Average very often seems to get
all the credit for being the (only) average.
Head First: Of course. Can you explain to our readers
why you’re so different?
Weighted Average: Sure. My cousin just adds all
values and divides that sum by the number of values.
Head First: And that’s not what you do?
Weighted Average: Well he gives every value the
same weight, while I’m allowing for different weights.
The values with a larger weight will have a greater
impact on the average, pulling it more towards them. I
take a piece of every value, but not necessarily an equal
piece.
Head First: You mean that some values are more
important to you than others?
Weighted Average: Exactly.
Head First: Why is that?
Weighted Average: Well, that depends on the
context. My cousin only cares about the values.
Head First: I see. So what about the context that our
readers are involved with here, rating restaurants?
Weighted Average: In that case, ratings from people
sharing your ‘opinion’ about restaurants you visited
might be more interesting to you.
Head First: Why wouldn’t a simple average do?
Weighted Average: Oh come on. A simple average
would be the same to everyone. That’s not using
context, only content (ed. the ratings). Besides very
extreme but rare ratings could mess up the whole
simple average. And my 4 doesn’t have to be your 4. We
can have different biases (ed. calibrations).
Head First: What do you mean?
Weighted Average: It’s because of your ‘taste
similarity’ to some people that their ratings may be
more important to you than those of others with less
similar taste to yours. Why go for a generic average
rating when you could go for a rating tailored to your
particular preferences.
Head First: So if people rate restaurants the same way
I do, their ratings for restaurants I didn’t visit would be
more useful to me?
Weighted Average: Right on.
Head First: Really?
Weighted Average: I’m not pulling this out of thin air
here. I got a lot more popular with the emergence of
social media, you know. Google, Amazon, Ebay … all use
my services. Even offline retailers do so when they
make personalized coupons. Why send shaving cream
coupons to a five year old, right? You’d better throw in
a computer game coupon instead.
Head First: That makes sense. I guess we’ll be seeing
more of you. Thank you very much for the interview.
Weighted Average: Don’t mention it. If you stay
around, you can see me on stage in a few pages.
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 9
1
2
3
4
… to recommend restaurants to someone,
we will …
… try to estimate his/her missing ratings …
(for restaurants (s)he didn’t visit)
… by averaging all known ratings … (for
that restaurant)
… but weighing them by taste similarity,
based on the difference in ratings between
recommendee and rator … (for the
restaurants they both visited)
so, for instance, Daisy’s 3 she gave
Gusteau’s, would get a weight
based on the similarity between her
and Sylvester
this ‘similarity’ will be derived
from the ratings of all restaurants
they both did rate
Before we dive into the specifics of a weighted average, let’s go over the big picture here
and set up a road map for what we’re about to do.
(c) Dirk Vandycke - 2012
capturing taste
10 Audition Chapter
3
4
To keep focus, let’s assume we want to estimate Sylvester’s rating for Gusteau’s. That way
we will be able to recommend Sylvester a rating for Gusteau’s. If we would need to
estimate more than on rating for one person, as would be the case for Jean, we could order
the restaurants he didn’t visit by those estimates. But let’s do this for Sylvester first. He
didn’t rate Gusteau’s. So what would Sylvester probably rate Gusteau’s if he went there.
We’ll have to average all ratings for Gusteau’s:
3 4 1 2
Daisy gave Gusteau’s a 3. So we’ll weigh her rating according to the
similarity between her ratings for other restaurants and Sylvester’s
ratings for those other restaurants.
We’ll do the same for the 4 from Donald. It’s weight will depend on
the similarity of ratings given to all restaurants by Donald and
Sylvester.
And the ratings for all restaurants from both Patrick and Sylvester
will decide on the weight Patrick’s 1 will get.
Finally Jean’s 2 will be weighted based on the ratings of all
restaurants Jean and Sylvester visited.
there are 2: Dorsia and the Half Moon
there's only one: the Half Moon restaurant
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 11
A weighted average adds all values to be averaged, after multiplying them first with
some weight. Finally this sum is divided by the sum of all weights used.
Quite a definition. Let’s go through it and see how to calculate a weighted average for
the ratings of Gusteau’s in our sample set.
× 2 × 1 + × 4 + × 3 +
We don’t have weights, do we?
Where will we get them?
Be patient. We’ll get to this next.
Since we have ratings, we have taste. By comparing
peoples taste we’ll be able to figure out how close their
taste really is. The closer people are in taste to the
person getting a recommendation, the higher we’ll
weigh their taste.
(c) Dirk Vandycke - 2012
capturing taste
12 Audition Chapter
The higher a value’s weight, the more the
average will lean towards that value, giving it
more ‘importance’. Take, for example, the
simple average of 5 and 7. It’s 6, which is
situated exactly in the middle between 5 and 7.
Suppose we want to give 7 triple the weight of
5, pulling the weighted average towards 7,
giving 7 a greater influence on the average.
Three times as much as 5, to be exact.
The new average, 6.5, is now 3 times as far from
5 as it is from 7.
1 × 5 + 3 × 7
1 + 3 = 6.5
5 + 7
2 = 6
Suppose 6 people gave one restaurant a simple
average score of 2, while 4 people giving that same
restaurant only a 1 and 5 people giving it a 4. Could
you figure out the total average score by using a
weighted average?
HINT: you should obviously make a weighted average of the scores 2, 1 and 4.
but you gotta figure out the weights.
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 13
Q: Where did the weights in the 5 and
7 example come from.
A: The exact weights don’t really matter, as
long as the weight of 7 is three times that
of 5. Using 3 and 9 or 15 and 45 as weights
will get you the exact same average.
Brain Barbell: In fact as long as you triple
the weight w of 5 to go with 7, the
equation always simplifies back to the
weights 1 and 3.
3 × 5 + 9 × 7
3 + 9 = 6.5
15 × 5 + 45 × 7
15 + 45 = 6.5
Suppose 6 people gave one restaurant a simple
average score of 2, while 4 people giving that same
restaurant only a 1 and 5 people giving it a 4. Could
you figure out the total average score by using a
weighted average?
6 + 4 + 5
= 2.4
1W × 5 + 3W ×
7 1W + 3W = 6.5
(c) Dirk Vandycke - 2012
capturing taste
14 Audition Chapter
Q: Could you please tell me again why we can’t
just list all restaurants with their average
ratings by experts and users?
A: Sure. The whole point is to get customized ratings.
Otherwise, averages would be the same to anyone,
regardless of their personal taste. You want a higher
weight put on ratings of people whose taste lines up
with yours. That’s why we’ll use weighted averages.
Q: But why do recommendations have to be
personalized? Isn’t a good restaurant just a
good restaurant, period.
A: Imagine you loving Italian food while getting ratings
of someone who hates Italian food. Do you think it
would be useful advice? What’s more, a fixed number
of experts can’t give an as diversified opinion as
hundreds of users.
Q: Why not?
A: A large community of users can cover far more
restaurants than any fixed number of experts can.
Also, there will be more ratings on each individual
restaurant. Besides expert ratings are only personal to
them. The whole idea is called collaborative filtering
and is heavily used by Internet moguls like Google,
Amazon, YouTube, Flickr, Facebook, LinkedIn … and
part of their success. Those sites behave on context as
well as content, when being searched. And the user is
just part of that context.
Q: I’m sorry. colla-what?
A: Collective Intelligence is about leveraging the
imperfect knowledge of lots of people to get smart
decisions. Collaborative filtering is a subdomain
focused on recommending systems and smart filtering
of data.
Q: I see. So it’s all mainly about personalized
recommendations. But any expert could give
personalized advice, no?
A: We wouldn’t be so sure of that. There’s a lot of
experimental data pointing in the direction of
collective intelligence giving better solutions then any
single expert.
Q: How can that be?
A: It’s a bit outside the scope of this book, but let’s
leave it at this: if lots of people err on both sides (for
better and for worse), then their errors will tend to
cancel out.
Q: Woah, a weighted average is totally
different from a regular average, right?
A: A simple average is actually a special case of
weighted average in which all weights are equal. But
since they’re not, in the case of a weighted average,
ratings are customized by the weights.
Q: On what will the weights for our ratings be
based? There’s no such thing as weights
available in the data we got.
A: People having similar taste to the user looking for a
recommendation will see their rating put in more
weight in the average. So we will calculate the weights
based on taste similarity of people. We’ll start tackling
this problem on the very next page.
(c) Dirk Vandycke - 2012
capturing taste
this is a new chapter � 15
What we do know:
1. In order to give someone restaurant recommendations, we will try to estimate
ratings of restaurants she would give restaurants she didn’t visit (yet). When looking
at our sample dataset, that would be the case for Sylvester. What rating would he give
Gusteau’s?
2. To estimate Sylvester’s rating for Gusteau’s, we will calculate a weighted average of
all of Gusteau’s ratings.
What we don’t know:
How much weight to put on each rating.
What we are going to do:
Put more weight on a ratings from people who are closer to
Sylvester’s taste, while putting lower weight on peoples rating who
have less similar taste to Sylvester’s.
(c) Dirk Vandycke - 2012
capturing taste
16 Audition Chapter
Well, almost …
People can get the better recommendations from
someone with similar taste to theirs.
So what we are going to do is try to figure out how
much the taste of different people is alike, instead
of trying to define one person’s taste.
Once we have a taste “distance” between people,
we can take all those distances, together with their
ratings, into account to distillate one average
rating.
We have ratings already
All we have to do is compare peoples ratings of
restaurants to discover how close their preferences
actually are.
Well that kinda narrows it down, doesn’t it.
Let’s just get our universal taste formula
out and let’s get going, right?
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 17
4
3
2
1
0 1 2 3 4
To turn individual ratings into something as useful as taste similarity between people
giving ratings and receiving recommendations, we need to figure out where each taste is
situated amongst all others. If we just could put ratings on an axis or a scale, we could use
regular “distance” as our measure. This is in fact perfectly possible.
Restaurant rating axis and people points
Image we put the ratings regarding any two restaurants along as many different axis. As
such, each point in such a system would represent a pair of ratings for both restaurants.
Those could be pairs of ratings belonging to one person, making for a point in this
system representing a person.
Could you put all the other
people from the sample
dataset in this picture?
or more, as we will soon see, but let’s go slow for now!
(c) Dirk Vandycke - 2012
capturing taste
18 Audition Chapter
4
3
2
1
0 1 2 3 4
Now that we have points representing people’s tastes. The distance between those
points becomes their taste distance. That’s exactly what we were looking for.
We added two more people to the system to make things clear.
Daisy & Donald have
the same rating/taste
regarding Dorsia, a
meager 1
Daisy
Any idea on how to calculate the
distance between Patrick and
Donald?
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 19
I get the part where we’ll give
ratings from people ‘more similar’
to the recommendee a higher
weight. But could you back up a
bit as to why we need distance?
Well, to be honest, any measurement trying to account for how close
people’s ratings are, would do. And in fact, very often linear regression
is used to accomplish this instead of simple Euclidean distance.
But this would really take us to far from the overall picture, right now.
And since Euclidean distance is a very good but foremost an easy way
to measure ‘proximity’ or ‘closeness’, we’ll use it to explain the
mechanics here.
Once you get the big picture with something as easy as a simple
distance formula, you can just plug in any measurement to assess
peoples ‘distance in taste’ you can think of.
But do not underestimate Euclidean distance. It can give great results
in practice, just as well!
(Euclidean) distance is the first
thing that comes to mind if we
want to measure how close
things are to each other.
(c) Dirk Vandycke - 2012
capturing taste
20 Audition Chapter
c = (3-1)2 + (4-1)2
You probably saw this in your math classes
with a formula, somewhere along this line:
c = 5
However easy it is reading the distance between Daisy and Donald from one axis, we do need
a general way to calculate the distance between two points. This is where Pythagoras’
theorem comes into play.
Pythagoras to the rescue
All we have to do is take the differences in
ratings from two people for each restaurant,
square them, add those squared differences
together, and finally take the square root out of
this sum.
C = A2 + B
2
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 21
Wait a minute. Why do
we only mention two
restaurants? There are
4 of them!
Well, we wanted to start off explaining the whole distance story as easy as possible. With just
two dimensions, we were able to actually visualize the taste space. With 3 or more dimensions
(one for each restaurant), we can’t visualize this no longer. The good news is we still can use
the same formula, no matter how much dimensions we end up having.
Let’s do this once more for the
distance between Patrick and Donald,
but this time with all dimensions.
1
2
3
4
For each restaurant, take the difference
between the ratings
1 – 3 4 – 1 ? 1 - 1
Square those differences (1 – 3)2 (4 – 1)2 (?)2 (1 – 1)2
Add the squared differences (1 – 3)2 + (4 – 1)2 + (?)2 + (1 – 1)2
Take the square root out of this sum (1 – 3)2 + (4 – 1)2 + (?)2 + (1 – 1)2
(c) Dirk Vandycke - 2012
capturing taste
22 Audition Chapter
And exactly where did all those dimension
come from? We had a nice formula for distance
in a 2D plain. But now we’re suddenly talking 4
dimensions here?
I see. Well ... than perhaps you
could enlighten us with the general
formula?
Ok, take a deep breath, while we go over this slowly.
We started off representing each restaurant on its own axis, setting
out people’s ratings along them. So one person’s overall taste gets
represented by one point. That way, the distance between people
in that axis system becomes an indication of how close their taste
is.
We first took on only 2 restaurants to introduce the concept. After
all, we can easily visualize two dimension on a book page. Heck,
using perspective we even could try to simulate three dimensions.
But in reality (as in our sample) people rate far more than two or
three restaurants. Of course it’s impossible to keep visualizing that.
But we don’t have to. Mathematics tells us we can still use the
same formula no matter how much dimensions we have.
Sure, let’s take a new page for this.
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 23
… how we use it in our context
To calculate the taste-distance between two people
having both rated the same n restaurants …
for each restaurant, subtract the matching ratings
given by both people
square those differences
add all differences into one sum
and finally take that sum’s square root
What mathematics tells us …
To calculate the distance between two
points in an n-dimensional space …
subtract all matching coordinates from both
points
square each difference
add all differences into one sum
and finally take the square root from that
sum
1
2
3
4
1
2
3
4
To calculate the taste-distance between two people, we are going to use the general
formula for Euclidean distance. The coordinates we are using are the ratings for the
restaurants we have from those two people.
(c) Dirk Vandycke - 2012
capturing taste
24 Audition Chapter
Q: Wait a minute, for the distance calculation
on page 20, you first subtract Donald’s rating
for Dorsia from Patrick’s, but it gets done the
other way around for the their ratings of
Gusteau’s. What’s going on.
A: Very subtle thing to notice. It doesn’t really matter
as the result gets squared anyway, so the sign always
becomes positive after squaring the difference.
Q: Ok but why do the other restaurant’s come into
play here. We originally were trying to estimate
Sylvester’s rating for Gusteau’s.
That is correct. However the more restaurant’s two
people rated, the more accurate a distance between
them can be determined. We might have 99
restaurants we agree upon but 1 where we sharply
disagree. This would make our taste very similar. But
not if we accidently only looked at the difference
between our ratings on the one we disagreed upon.
Q: What happens if there’s no rating from a
person for a certain restaurant?
A: Just don’t use that dimension in the distance
calculation. Notice that this is not the same as
using zero for any missing ratings! Another
possibility would be to us the estimate ratings
we will eventually be able to calculate.
Q: This Euclidean distance seems quite
important for measuring differences in taste,
doesn’t it?
A: Well it sure is a great way to introduce these
concepts. But it really isn’t the only distance measure
we could have used. There are a lot more of them all
with their own intricacies and advantages. There’s
nearest neighbor, vector cosine, Pearson correlation,
Jaccard coefficient, Hamming distance, Manhattan
distance, and so on …
Since we’ll need them as weights for calculating
Sylvester’s rating on Gusteau’s, could you calculate
the taste distances from Sylvester to everyone else?
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 25
Taste distance follows from putting people according to
their ratings in taste space, where each axis represents
the ratings for a particular restaurant. Since we can
only visual represent two dimensions (restaurants), we
prefer using the common formula for calculating
distances immediately.
The weight each person’s rating gets,
depends on how similar this person’s
taste is to Sylvester’s.
Let’s quickly go over what we’ve been up to until this point, one more time.
We want to make recommendations
to Sylvester. This was translated in
estimating a rating for restaurants he
hasn’t visited. Which is only Gusteau’s
in his case.
1
2
3
4
To get such a rating for Gusteau’s
customized to Sylvester, we want to
calculate a weighted average of all
Gusteau’s ratings
(1 - 3)2 + (3 - ?)2 + (3 - 4) 2 + (0 – 3) 2
(c) Dirk Vandycke - 2012
capturing taste
26 Audition Chapter
Since we’ll need them as weights for calculating
Sylvester’s rating on Gusteau’s, could you calculate
the taste distances from Sylvester to everyone else?
now, let’s take on the distance between Daisy and Sylvester for starters
first of all, let’s grab
a copy of our sample
dataset
(1 - 3)2 + (3 - ?)2 + (3 - 4) 2 + (0 – 3) 2 = 3.74 let’s forget about this
next, we’ll do the distance between Donald and Sylvester
(1 - 3)2 + (4 - ?)2 + (? - 4) 2 + (1 – 3) 2 = 2.83
next, we have the distance between Patrick and Sylvester
(3 - 3)2 + (1 - ?)2 + (2 - 4) 2 + (1 – 3) 2 = 2.83
finally, there’s the distance between Jean and Sylvester left
(? - 3)2 + (4 - ?)2 + (? - 4) 2 + (4 – 3) 2 = 1.00
If we need taste distances as weights for our averaged weighted rating of Gusteau’s for
Sylvester, we’d better start figuring them out at this point.
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 27
I really hate to spoil the fun again. But the closer
people their taste is, the shorter the distance gets
between them. I would love to see a BIGGER number
for people more closely together in taste.
Sharp! But that’s not that big an issue. You
could simply invert the distance. Making
big what is small and vice versa.
But to avoid the risk of division by zero,
we’ll first add 1 to anything we invert.
So if Daisy’s distance to Sylvester is 3.74,
we’ll use 1/(1+3.74) as weight for Daisy’s
rating. That would give us 0.21 as Daisy’s
rating’s weight.
0.26 0.21 0.50 0.26
(c) Dirk Vandycke - 2012
capturing taste
28 Audition Chapter
Dorsia Gusteau's Planet
Arraywood
Half Moon
Utility Restaurant
Daisy 1 3 3 0
Donald 1 4 - 1
Patrick 3 1 2 1
Jean - 2 - 4
Sylvester 3 - 4 3
0.26 0.21 0.50 0.26
0.26 0.21 0.50 0.26
Sylvester’s estimate
rating for Gusteau’s = × 2 × 1 + × 4 + × 3 +
= 1.98
Here we are. Finally! Now we’re able to fill in the weights into the weighted average that will
rate Gusteau’s for Sylvester.
(c) Dirk Vandycke - 2012
collaborative filtering
you are here � 29
Q: So if a rating is not known we just leave it
out?
A: Exactly. Just don’t use that dimension while
crunching the numbers. Don’t forget to leave out the
weight in the divisor’s sum if there’s no matching
rating, though!
Q: Seems like this will only work well if you
have lots of ratings for the same restaurants by
lots of people.
A: Spot on! The more ratings the better the estimates.
But even with only this small sample of test data,
collective intelligence can be already quite useful.
Q: Tell me, won’t people have privacy issues with
gathering their ratings?
A: Not really, all we need is the ratings belonging to
someone. We might anonymize everybody as mister
X, Y, Z. We just need some cookie or session to give
the recommendations to the right person.
Q: But how can we estimate ratings if a new
visitor hasn’t rated anything yet..
A: There are a number of ways to handle this. From
translating like/dislike clicks into ratings to using a
normal average to start with (replacing weights as
soon as behavioral data is captured from the user).
Could you calculate an estimate for all
missing ratings in our test data?
Sylvesters’ estimate rating for Gusteau’s is 1.98
Donald’s estimate rating for Planet Arraywood is
(we already have this one)
Jean’s estimate rating for Planet Arraywood is
Jean’s estimate rating for Dorsia is
(c) Dirk Vandycke - 2012
capturing taste
this is a new chapter � 30
There’s more we didn’t handle in this audition, but definitely will address in the final
chapter.
1. Meat eaters probably will systematically underrate vegetarian restaurants. Some
people will have a slightly different bias in their ratings than others. This surely will
influence Euclidean distance, and hence recommendations. This problem can be
solved by using Pearson correlation instead of Euclidean distance.
2. What about the number of distances that will have to be calculated. Having 5000
users with their ratings, forces us to calculate 5000 distances for a every new user.
And then we wouldn’t have even started calculating the weighted averages yet for
every restaurant that’s in our database. What’s more, each time someone rates an
additional restaurant, that would make all calculated distance values for that person
invalid immediately. These problems can be countered by item-based filtering (as
opposed to the user-based filtering we focused on). This would make for a nice fire
side chat between user-based and item-based filtering.
3. As an extra, people could rate restaurants on different topics (food, service, price,
wine, coziness, …). How would this influence our collaborative filtering?
4. And we haven’t even started programming all of this. Perhaps algorithms
could/should better be explained without interference of discussing
implementation? So I decided to try this out and stay away from coding on this one.
If we explain coding all of this, should we do it in parallel while explaining the
algorithm, or after we explained it?
(c) Dirk Vandycke - 2012