algorithms by the dozen - monest.net

43
counting sort you are here 1 1 sorting things out Better searching through sorting. Computers would be of no practical use if we couldn’t retrieve all that data we have put into them. Finding and retrieving information, just as in a non-digital world, can be done far more efficient when things are organized. Imagine looking up someone in a phone directory ordered by phone number instead of by name. Sorting them by back number was a great idea. I never could tell them apart from behind. Algorithms by the Dozen (c) Dirk Vandycke - 2012

Upload: others

Post on 16-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

counting sort

you are here � 1

1 s o r t ing th ings out

Better searching through sorting. Computers would be of no

practical use if we couldn’t retrieve all that data we have put into them. Finding and

retrieving information, just as in a non-digital world, can be done far more efficient

when things are organized. Imagine looking up someone in a phone directory ordered

by phone number instead of by name.

Sorting them by back number

was a great idea. I never

could tell them apart from

behind.

Algorithms by the Dozen

(c) Dirk Vandycke - 2012

sorting things out

2 Audition Chapter

Just to make the spreads line up in our

word processor software making (two-

sided) printing unnecessary

So no trees need to be harmed in the

evaluation of this audition chap

this page intentionally left blank

(c) Dirk Vandycke - 2012

counting sort

you are here � 3

That’s true.

But since good wine tends to improve with

age, let’s sort them bottles by how old they

are.

Yeah … great idea …

… we could use the quicksort

algorithm here … being it

the fastest anyway

What do you mean?

We could do that … however …

… there might be a slightly easier and faster way to

do this. Imagine reframing the problem a bit,

looking at it as a counting problem instead of a

sorting problem.

We could count the bottles for every age we

encounter

that way we are sorting them as well. Let’s find out

how …

… meanwhile at Château

d’Inhéritance …

(c) Dirk Vandycke - 2012

sorting things out

4 Audition Chapter

Now, counting sort, simply called count sort by its friends, is the algorithm we

want to have a look at here. Although it is a fully fledged, fast and easy to

implement way of ordering our data, its primary focus is actually on counting,

not on sorting. Hence its name. Let’s dive into the nitty gritty of counting sort.

Here’s what the steps of the algorithm look like:

First we have to list up every age we might encounter, in ascending

order.

Next, we’ll go over every wine bottle asking it for its age.

While doing that, we’re going to keep a tally of each age we encounter.

We’ll have to make sure we can keep records of this when we

implement the algorithm in a programming language. But for now we

can just count in fives, on a piece of paper.

When done, all that’s left is going over the list of ages, writing down

every age that was counted at least once. Et voila, we have a sorted list

of all ages.

1

2

3

4

Giving us 0, 1, 2, 4, … (no 3!). That’s pretty

straightforward, isn’t it?

(oh, zero years old meaning less than one year)

(c) Dirk Vandycke - 2012

counting sort

you are here � 5

O-kay. And exactly how does all

this relate to sorting, again? As I

see it, we are just counting stuff!

Well, you’re right, we’re not just there yet

We now have to switch gear and implement the

whole algorithm. For now it may seem as if we’re

just holding the number of wine bottles for each

individual age. This having nothing to do with

actually sorting our bottles according to age. But ..

… our data IS sorted already!

Take a close look at our counting table. The data

we wanted to sort is written down ascendingly. So

our wine ages always will be sorted while

counting.

frequency

(c) Dirk Vandycke - 2012

sorting things out

6 Audition Chapter

5 8 1 0 3 f(k)

0 1 2 3 4 k

A frequency table or histogram holds the

number of occurrences for each possible

value in your data.

A frequency table is like a an array of counters. One for every value we encounter

in the data. Such tables are frequently used in statistics and even in artificial

intelligence. We certainly will see more of them in this book.

8 bottles of 1 year old

wine were counted here

Frequency

tables very

often get

visually

represented

as bar

charts

(c) Dirk Vandycke - 2012

counting sort

you are here � 7

0 1 2 3 4 n

0 0 0 0 1 0

0

1 2 3 4 99

Here we assume that our original array just holds the ages of all wine bottles.

How would this code change if we instead had an array of WineBottle objects to

begin with?

4

Now that we know the sorting count algorithm, how about implementing it in a

real programming language. Let’s go over the steps of the algorithm once more

and see if we can explain this to Mr. JVM.

1 2 4 7 4

The first bottle, being

4 years old, got

counted here

(c) Dirk Vandycke - 2012

sorting things out

8 Audition Chapter

We had four steps in our algorithm. Let’s go through them again, trying to write the

corresponding code. Let’s start from an array of ages we will call wineAges.

First we had to list up every age we might encounter, in ascending order. This

translates in defining a frequency table and initializing all counters to zero.

int[] wineAgeFrequencies = new int[100];

for (int i = 0; i < wineAgeFrequencies.length; i++) wineAgeFrequencies [0] = 0;

Next, we were supposed to go over every wine bottle asking it for its age. That

sounds like a job laid aside for a for-loop.

for (int i = 0; i < wineAges.length; i++) wineAgeFrequencies[wineAges[i]]++;

And while doing that, we were going to keep a tally of each age we encounter.

Meaning we’ll have to increase the frequency for that age.

When done, all that’s left is going over the list of ages, writing down every age as

many times as it was counted. We finally end up with a sorted list of all ages.

int k = 0; for (int i = 0; i < wineAgeFrequencies.length; i++)

for (int j = 0; j < wineAgeFrequencies[i]; j++) wineAges[k++] = i;

1

2

3

4

Most modern languages (like Java) initialize all cells to zero automatically.

What’s with this k variable???

(c) Dirk Vandycke - 2012

counting sort

you are here � 9

First there’s the line counting each age’s frequency

wineAgeFrequencies[wineAges[i]]++;

Which we, for clarity reasons, could restate as

int wineAge = winAges[i]; wineAgeFrequencies[wineAge]++;

Secondly, we have the way the array is finally rewritten ascendingly.

int k = 0; for (int i = 0; i < wineAgeFrequencies.length; i++)

for (int j = 0; j < wineAgeFrequencies[i]; j++) wineAges [k++] = i;

for every value counted

(i.e. for every different wine age)

… … as many times as we counted it (as given by its frequency) …

There are two bits of code in our implementation that we’ll have to look at

carefully to fully grasp what they are doing. Let’s look at that code, up close.

Add one to the count of that particular age

… write the value (age) down in the original array. We keep track of

where we are writing in the array by a good old fashioned counter, k

We increment the counter every time after it’s been used

(c) Dirk Vandycke - 2012

sorting things out

10 Audition Chapter

Q: Why do we need yet another sorting

algorithm? Aren’t we complicating

things?

First of all, because it’s one of the fastest

ways of sorting. But it’s also a very insightful

example of how reframing a problem can

lead to a more efficient solution.

As a nice bonus, this sorting algorithm can

easily remove duplicate values.

Last but not least it is a ‘stable’ sorting

algorithm as opposed to all others we saw.

Stable sorting will come in handy when we

look at radix sort.

Q: How do we know it’s one of the

fastest ways of sorting?

Because, unlike quicksort, insertion sort and

merge sort, it’s not a comparison sort. Not a

single comparison between elements

occurs. It handles sorting by counting.

Hence its name: counting sort.

When sorting, you have to at least look at

your data once. Counting sort just runs over

all data, looking only once at each element.

This is also called sorting in linear time. The

time needed to sort all values is linear to the

number of values.

Q: Then why bother using the other

sorting algorithms?

Counting sort uses its values as indexes. So

values sorted have to be non-negative

integers (a technical constraint in a lot of

programming languages). There’s a

workaround to this by scaling and shifting

the range of values to a range of integers.

However counting sort’s performance

deteriorates quickly from linear time and is

caught up by quicksort as soon as the

number of possible values (and with it the

length of the frequency table) starts to

outweigh the total number of values to be

sorted (i.e. length of the original data array).

Finally, count sorting isn’t an in-place

sorting algorithm.

Q: Aren’t we using more memory this

way?

Of course. We need a frequency table. But

the length of this array only depends on the

range of values. It’s indifferent as to the

total number of values. As is very often the

case with algorithms, we can, to a certain

extent, exchange speed for memory. So

here we have a faster algorithm. More

memory is the price paid for this.

It also illustrates the relation between data

structures and algorithms. Better data

structures (in the form of an additional

frequency table, here) lead to less complex

(and often even faster) algorithms.

(c) Dirk Vandycke - 2012

counting sort

this is a new chapter � 11

0 1 2 3 4 n

Wait just a minute. We are

actually sorting wine ages, not

wine bottles!

Well, you sure got us there

By sorting the wine ages we could reuse the same

array, making it an in-place sorting algorithm.

But as soon as we start using objects. This solution

simply isn’t going to cut it anymore.

We don’t like to admit it but sorting objects with

counting sort is even tougher then it seems. But

we can still pull this off!

1 2 4 7 4 4

(c) Dirk Vandycke - 2012

sorting things out

12 Audition Chapter

Let’s go over our code dealing with this new array of WineBottle objects en see what

needs to change. Since step 1 doesn’t get affected we’ll take off at step 2 (and step 3).

for (int i = 0; i < wineCellar.length; i++) wineAgeFrequencies[wineCellar[i].getAge()]++;

So we’re off the hook, right? Unfortunately it’s not that easy. Look at step 4 again.

int k = 0; for (int i = 0; i < wineAgeFrequencies.length; i++)

for (int j = 0; j < wineAgeFrequencies[i]; j++) wineAges[k++] = i;

The first problem here is wineAges. We should change that to wineCellar.

However that’s when our biggest problem pops up and destroys everything we worked

for. We can’t put ages where bottles are expected. What’s more, just by knowing the

order of the ages how can we figure out where bottles will belong in their sorted

order?

3 2

4

incompatible types

required: WineBottle

found: int

wineCellar

(c) Dirk Vandycke - 2012

counting sort

you are here � 13

will counting sort

ever be able to keep

up with its

promises?

watch for the upcoming new

title in the Head First series

HF Algorithms &

Data Structures

(c) Dirk Vandycke - 2012

capturing taste

this is a new chapter � 1

2 captur ing taste

Recommendation is everywhere. Recommendations and ads are

issued billions of times a day on the net. Recommending movies, books, e-commerce

products and services … and even people. Want to know how Google, Amazon and

dating sites are making highly accurate suggestions to their customers? Looking to

write your own recommender system? Is a customer going to buy your new product?

Diving into this chapter will get you those answers.

Oh my. What a perfect

match we are. Thanks to the

collaborative filtering on the

dating site we were using.

Coming highly recommended

(c) Dirk Vandycke - 2012

capturing taste

2 Audition Chapter

Just to make the spreads line up in our

word processor software, making (two-

sided) printing unnecessary

So no trees need to be harmed in the

evaluation of this audition chap

this page intentionally left blank

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 3

The union of food critics told their

members to lay aside work because of

the growing number of places to cover.

But hiring more of them to reduce

their load, really would be too

expensive.

Over the years, the “Epicure Restaurant Guide” has made quite a name for itself as

the leading source for gastronomic advice with the most complete coverage

available on places to eat, worldwide. Their critics visit thousands of restaurants

daily, reporting their experiences through the famous guide.

To cope with the fact that there’s only one print a year of the guide, Epicure

company has recently established web presence to put up-to-date review reports

online. They also enabled users to upload their personal reviews and ratings. The

sky seemed to be the limit.

But there’s a cloud in this bright sky …

I was told that all those ratings we’re

gathering from our users could be turned

into useful recommendations. But I’m a bit

skeptic about the whole idea. Perhaps you

can help us out?

and health concerns!

(c) Dirk Vandycke - 2012

capturing taste

4 Audition Chapter

Dorsia Gusteau's Planet

Arraywood

Half Moon

Utility Restaurant

Daisy 1 3 3 0

Donald 1 4 - 1

Patrick 3 1 2 1

Jean - 2 - 4

Sylvester 3 - 4 3

We’ve already taken a look at the website for

you. It seems visitors can rate restaurants

on a scale from 0 to 4 stars.

They also can enter data of new restaurants

that aren’t reviewed by Epicure’s experts yet.

We already went ahead of ourselves and

asked for a sample data set of ratings from

Epicure. And they just got back to us.

Think about this before you turn the page.

Which restaurant would you recommend to

someone else from this sample? Why?

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 5

Dorsia Gusteau's Planet

Arraywood

Half Moon

Utility Restaurant

1 3 3 0

1 4 - 1

3 1 2 1

- 2 - 4

3 - 4 3

Seems pretty straightforward to me. We should

figure out each restaurant’s average rating. Then

we’d be able to recommend the restaurant with the

highest average score first. Like this ...

2 2.5 3 1.8

5

So ‘Planet Arraywood’ definitely would be the

best recommendation to go within this case.

With Gusteau’s being runner up.

(c) Dirk Vandycke - 2012

capturing taste

6 Audition Chapter

Sad but true

Average ratings only account for the ratings, not for

where they came from. You may have noticed that we

didn’t use all of the data we were given.

So recommendations will end up to be exactly the

same for everyone. They will never be personalized.

Which is too bad, because restaurants are all about

taste. And taste is very personal.

Wait just a minute there! That way, every

visitor gets to see the same recommendation?

I visited ‘Planet Arraywood’ and it’s not uhm …

exactly my ‘taste’.

Could you come up with some disadvantages of using the

average rating of restaurants to make recommendations?

Everyone gets the same recommendation

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 7

We need to include taste

When we average ratings, we are, in fact,

averaging people’s opinions. Someone giving

a high score to a restaurant, definitely must

like that restaurant more than someone giving

it a lower rating.

So getting an average rating of people who

have visited the restaurant, like we did, still

seems like a good start. The problem with a

simple average is that everyone’s taste is

being put on the same footing.

But people can get better recommendations

from people with a similar taste to theirs. So

we are going to put a higher weight on ratings

from people having a closer taste. What we

need is a weighted average.

Everyone gets the same recommendation

Could you come up with some disadvantages of using the

average rating of restaurants to make recommendations?

a new restaurant might not have ratings yet, even if it’s good good on average doesn’t have to mean good for everyone an average rating doesn’t take into account who gave the ratings ratings are subject to taste, while an average is purely objective every rating just has the same ‘impact’ on the average the best restaurant on average may get very crowded

So, averages won’t cut

it. I would love to see

you do better.

(c) Dirk Vandycke - 2012

capturing taste

8 Audition Chapter

Head First: Welcome, Mr. Average. How are you.

Weighted Average: Fine, thank you. But I do prefer

you call me Weighted Average, if you don’t mind. I

really do come from a large family. And my very

popular cousin Simple Average very often seems to get

all the credit for being the (only) average.

Head First: Of course. Can you explain to our readers

why you’re so different?

Weighted Average: Sure. My cousin just adds all

values and divides that sum by the number of values.

Head First: And that’s not what you do?

Weighted Average: Well he gives every value the

same weight, while I’m allowing for different weights.

The values with a larger weight will have a greater

impact on the average, pulling it more towards them. I

take a piece of every value, but not necessarily an equal

piece.

Head First: You mean that some values are more

important to you than others?

Weighted Average: Exactly.

Head First: Why is that?

Weighted Average: Well, that depends on the

context. My cousin only cares about the values.

Head First: I see. So what about the context that our

readers are involved with here, rating restaurants?

Weighted Average: In that case, ratings from people

sharing your ‘opinion’ about restaurants you visited

might be more interesting to you.

Head First: Why wouldn’t a simple average do?

Weighted Average: Oh come on. A simple average

would be the same to everyone. That’s not using

context, only content (ed. the ratings). Besides very

extreme but rare ratings could mess up the whole

simple average. And my 4 doesn’t have to be your 4. We

can have different biases (ed. calibrations).

Head First: What do you mean?

Weighted Average: It’s because of your ‘taste

similarity’ to some people that their ratings may be

more important to you than those of others with less

similar taste to yours. Why go for a generic average

rating when you could go for a rating tailored to your

particular preferences.

Head First: So if people rate restaurants the same way

I do, their ratings for restaurants I didn’t visit would be

more useful to me?

Weighted Average: Right on.

Head First: Really?

Weighted Average: I’m not pulling this out of thin air

here. I got a lot more popular with the emergence of

social media, you know. Google, Amazon, Ebay … all use

my services. Even offline retailers do so when they

make personalized coupons. Why send shaving cream

coupons to a five year old, right? You’d better throw in

a computer game coupon instead.

Head First: That makes sense. I guess we’ll be seeing

more of you. Thank you very much for the interview.

Weighted Average: Don’t mention it. If you stay

around, you can see me on stage in a few pages.

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 9

1

2

3

4

… to recommend restaurants to someone,

we will …

… try to estimate his/her missing ratings …

(for restaurants (s)he didn’t visit)

… by averaging all known ratings … (for

that restaurant)

… but weighing them by taste similarity,

based on the difference in ratings between

recommendee and rator … (for the

restaurants they both visited)

so, for instance, Daisy’s 3 she gave

Gusteau’s, would get a weight

based on the similarity between her

and Sylvester

this ‘similarity’ will be derived

from the ratings of all restaurants

they both did rate

Before we dive into the specifics of a weighted average, let’s go over the big picture here

and set up a road map for what we’re about to do.

(c) Dirk Vandycke - 2012

capturing taste

10 Audition Chapter

3

4

To keep focus, let’s assume we want to estimate Sylvester’s rating for Gusteau’s. That way

we will be able to recommend Sylvester a rating for Gusteau’s. If we would need to

estimate more than on rating for one person, as would be the case for Jean, we could order

the restaurants he didn’t visit by those estimates. But let’s do this for Sylvester first. He

didn’t rate Gusteau’s. So what would Sylvester probably rate Gusteau’s if he went there.

We’ll have to average all ratings for Gusteau’s:

3 4 1 2

Daisy gave Gusteau’s a 3. So we’ll weigh her rating according to the

similarity between her ratings for other restaurants and Sylvester’s

ratings for those other restaurants.

We’ll do the same for the 4 from Donald. It’s weight will depend on

the similarity of ratings given to all restaurants by Donald and

Sylvester.

And the ratings for all restaurants from both Patrick and Sylvester

will decide on the weight Patrick’s 1 will get.

Finally Jean’s 2 will be weighted based on the ratings of all

restaurants Jean and Sylvester visited.

there are 2: Dorsia and the Half Moon

there's only one: the Half Moon restaurant

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 11

A weighted average adds all values to be averaged, after multiplying them first with

some weight. Finally this sum is divided by the sum of all weights used.

Quite a definition. Let’s go through it and see how to calculate a weighted average for

the ratings of Gusteau’s in our sample set.

× 2 × 1 + × 4 + × 3 +

We don’t have weights, do we?

Where will we get them?

Be patient. We’ll get to this next.

Since we have ratings, we have taste. By comparing

peoples taste we’ll be able to figure out how close their

taste really is. The closer people are in taste to the

person getting a recommendation, the higher we’ll

weigh their taste.

(c) Dirk Vandycke - 2012

capturing taste

12 Audition Chapter

The higher a value’s weight, the more the

average will lean towards that value, giving it

more ‘importance’. Take, for example, the

simple average of 5 and 7. It’s 6, which is

situated exactly in the middle between 5 and 7.

Suppose we want to give 7 triple the weight of

5, pulling the weighted average towards 7,

giving 7 a greater influence on the average.

Three times as much as 5, to be exact.

The new average, 6.5, is now 3 times as far from

5 as it is from 7.

1 × 5 + 3 × 7

1 + 3 = 6.5

5 + 7

2 = 6

Suppose 6 people gave one restaurant a simple

average score of 2, while 4 people giving that same

restaurant only a 1 and 5 people giving it a 4. Could

you figure out the total average score by using a

weighted average?

HINT: you should obviously make a weighted average of the scores 2, 1 and 4.

but you gotta figure out the weights.

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 13

Q: Where did the weights in the 5 and

7 example come from.

A: The exact weights don’t really matter, as

long as the weight of 7 is three times that

of 5. Using 3 and 9 or 15 and 45 as weights

will get you the exact same average.

Brain Barbell: In fact as long as you triple

the weight w of 5 to go with 7, the

equation always simplifies back to the

weights 1 and 3.

3 × 5 + 9 × 7

3 + 9 = 6.5

15 × 5 + 45 × 7

15 + 45 = 6.5

Suppose 6 people gave one restaurant a simple

average score of 2, while 4 people giving that same

restaurant only a 1 and 5 people giving it a 4. Could

you figure out the total average score by using a

weighted average?

6 + 4 + 5

= 2.4

1W × 5 + 3W ×

7 1W + 3W = 6.5

(c) Dirk Vandycke - 2012

capturing taste

14 Audition Chapter

Q: Could you please tell me again why we can’t

just list all restaurants with their average

ratings by experts and users?

A: Sure. The whole point is to get customized ratings.

Otherwise, averages would be the same to anyone,

regardless of their personal taste. You want a higher

weight put on ratings of people whose taste lines up

with yours. That’s why we’ll use weighted averages.

Q: But why do recommendations have to be

personalized? Isn’t a good restaurant just a

good restaurant, period.

A: Imagine you loving Italian food while getting ratings

of someone who hates Italian food. Do you think it

would be useful advice? What’s more, a fixed number

of experts can’t give an as diversified opinion as

hundreds of users.

Q: Why not?

A: A large community of users can cover far more

restaurants than any fixed number of experts can.

Also, there will be more ratings on each individual

restaurant. Besides expert ratings are only personal to

them. The whole idea is called collaborative filtering

and is heavily used by Internet moguls like Google,

Amazon, YouTube, Flickr, Facebook, LinkedIn … and

part of their success. Those sites behave on context as

well as content, when being searched. And the user is

just part of that context.

Q: I’m sorry. colla-what?

A: Collective Intelligence is about leveraging the

imperfect knowledge of lots of people to get smart

decisions. Collaborative filtering is a subdomain

focused on recommending systems and smart filtering

of data.

Q: I see. So it’s all mainly about personalized

recommendations. But any expert could give

personalized advice, no?

A: We wouldn’t be so sure of that. There’s a lot of

experimental data pointing in the direction of

collective intelligence giving better solutions then any

single expert.

Q: How can that be?

A: It’s a bit outside the scope of this book, but let’s

leave it at this: if lots of people err on both sides (for

better and for worse), then their errors will tend to

cancel out.

Q: Woah, a weighted average is totally

different from a regular average, right?

A: A simple average is actually a special case of

weighted average in which all weights are equal. But

since they’re not, in the case of a weighted average,

ratings are customized by the weights.

Q: On what will the weights for our ratings be

based? There’s no such thing as weights

available in the data we got.

A: People having similar taste to the user looking for a

recommendation will see their rating put in more

weight in the average. So we will calculate the weights

based on taste similarity of people. We’ll start tackling

this problem on the very next page.

(c) Dirk Vandycke - 2012

capturing taste

this is a new chapter � 15

What we do know:

1. In order to give someone restaurant recommendations, we will try to estimate

ratings of restaurants she would give restaurants she didn’t visit (yet). When looking

at our sample dataset, that would be the case for Sylvester. What rating would he give

Gusteau’s?

2. To estimate Sylvester’s rating for Gusteau’s, we will calculate a weighted average of

all of Gusteau’s ratings.

What we don’t know:

How much weight to put on each rating.

What we are going to do:

Put more weight on a ratings from people who are closer to

Sylvester’s taste, while putting lower weight on peoples rating who

have less similar taste to Sylvester’s.

(c) Dirk Vandycke - 2012

capturing taste

16 Audition Chapter

Well, almost …

People can get the better recommendations from

someone with similar taste to theirs.

So what we are going to do is try to figure out how

much the taste of different people is alike, instead

of trying to define one person’s taste.

Once we have a taste “distance” between people,

we can take all those distances, together with their

ratings, into account to distillate one average

rating.

We have ratings already

All we have to do is compare peoples ratings of

restaurants to discover how close their preferences

actually are.

Well that kinda narrows it down, doesn’t it.

Let’s just get our universal taste formula

out and let’s get going, right?

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 17

4

3

2

1

0 1 2 3 4

To turn individual ratings into something as useful as taste similarity between people

giving ratings and receiving recommendations, we need to figure out where each taste is

situated amongst all others. If we just could put ratings on an axis or a scale, we could use

regular “distance” as our measure. This is in fact perfectly possible.

Restaurant rating axis and people points

Image we put the ratings regarding any two restaurants along as many different axis. As

such, each point in such a system would represent a pair of ratings for both restaurants.

Those could be pairs of ratings belonging to one person, making for a point in this

system representing a person.

Could you put all the other

people from the sample

dataset in this picture?

or more, as we will soon see, but let’s go slow for now!

(c) Dirk Vandycke - 2012

capturing taste

18 Audition Chapter

4

3

2

1

0 1 2 3 4

Now that we have points representing people’s tastes. The distance between those

points becomes their taste distance. That’s exactly what we were looking for.

We added two more people to the system to make things clear.

Daisy & Donald have

the same rating/taste

regarding Dorsia, a

meager 1

Daisy

Any idea on how to calculate the

distance between Patrick and

Donald?

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 19

I get the part where we’ll give

ratings from people ‘more similar’

to the recommendee a higher

weight. But could you back up a

bit as to why we need distance?

Well, to be honest, any measurement trying to account for how close

people’s ratings are, would do. And in fact, very often linear regression

is used to accomplish this instead of simple Euclidean distance.

But this would really take us to far from the overall picture, right now.

And since Euclidean distance is a very good but foremost an easy way

to measure ‘proximity’ or ‘closeness’, we’ll use it to explain the

mechanics here.

Once you get the big picture with something as easy as a simple

distance formula, you can just plug in any measurement to assess

peoples ‘distance in taste’ you can think of.

But do not underestimate Euclidean distance. It can give great results

in practice, just as well!

(Euclidean) distance is the first

thing that comes to mind if we

want to measure how close

things are to each other.

(c) Dirk Vandycke - 2012

capturing taste

20 Audition Chapter

c = (3-1)2 + (4-1)2

You probably saw this in your math classes

with a formula, somewhere along this line:

c = 5

However easy it is reading the distance between Daisy and Donald from one axis, we do need

a general way to calculate the distance between two points. This is where Pythagoras’

theorem comes into play.

Pythagoras to the rescue

All we have to do is take the differences in

ratings from two people for each restaurant,

square them, add those squared differences

together, and finally take the square root out of

this sum.

C = A2 + B

2

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 21

Wait a minute. Why do

we only mention two

restaurants? There are

4 of them!

Well, we wanted to start off explaining the whole distance story as easy as possible. With just

two dimensions, we were able to actually visualize the taste space. With 3 or more dimensions

(one for each restaurant), we can’t visualize this no longer. The good news is we still can use

the same formula, no matter how much dimensions we end up having.

Let’s do this once more for the

distance between Patrick and Donald,

but this time with all dimensions.

1

2

3

4

For each restaurant, take the difference

between the ratings

1 – 3 4 – 1 ? 1 - 1

Square those differences (1 – 3)2 (4 – 1)2 (?)2 (1 – 1)2

Add the squared differences (1 – 3)2 + (4 – 1)2 + (?)2 + (1 – 1)2

Take the square root out of this sum (1 – 3)2 + (4 – 1)2 + (?)2 + (1 – 1)2

(c) Dirk Vandycke - 2012

capturing taste

22 Audition Chapter

And exactly where did all those dimension

come from? We had a nice formula for distance

in a 2D plain. But now we’re suddenly talking 4

dimensions here?

I see. Well ... than perhaps you

could enlighten us with the general

formula?

Ok, take a deep breath, while we go over this slowly.

We started off representing each restaurant on its own axis, setting

out people’s ratings along them. So one person’s overall taste gets

represented by one point. That way, the distance between people

in that axis system becomes an indication of how close their taste

is.

We first took on only 2 restaurants to introduce the concept. After

all, we can easily visualize two dimension on a book page. Heck,

using perspective we even could try to simulate three dimensions.

But in reality (as in our sample) people rate far more than two or

three restaurants. Of course it’s impossible to keep visualizing that.

But we don’t have to. Mathematics tells us we can still use the

same formula no matter how much dimensions we have.

Sure, let’s take a new page for this.

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 23

… how we use it in our context

To calculate the taste-distance between two people

having both rated the same n restaurants …

for each restaurant, subtract the matching ratings

given by both people

square those differences

add all differences into one sum

and finally take that sum’s square root

What mathematics tells us …

To calculate the distance between two

points in an n-dimensional space …

subtract all matching coordinates from both

points

square each difference

add all differences into one sum

and finally take the square root from that

sum

1

2

3

4

1

2

3

4

To calculate the taste-distance between two people, we are going to use the general

formula for Euclidean distance. The coordinates we are using are the ratings for the

restaurants we have from those two people.

(c) Dirk Vandycke - 2012

capturing taste

24 Audition Chapter

Q: Wait a minute, for the distance calculation

on page 20, you first subtract Donald’s rating

for Dorsia from Patrick’s, but it gets done the

other way around for the their ratings of

Gusteau’s. What’s going on.

A: Very subtle thing to notice. It doesn’t really matter

as the result gets squared anyway, so the sign always

becomes positive after squaring the difference.

Q: Ok but why do the other restaurant’s come into

play here. We originally were trying to estimate

Sylvester’s rating for Gusteau’s.

That is correct. However the more restaurant’s two

people rated, the more accurate a distance between

them can be determined. We might have 99

restaurants we agree upon but 1 where we sharply

disagree. This would make our taste very similar. But

not if we accidently only looked at the difference

between our ratings on the one we disagreed upon.

Q: What happens if there’s no rating from a

person for a certain restaurant?

A: Just don’t use that dimension in the distance

calculation. Notice that this is not the same as

using zero for any missing ratings! Another

possibility would be to us the estimate ratings

we will eventually be able to calculate.

Q: This Euclidean distance seems quite

important for measuring differences in taste,

doesn’t it?

A: Well it sure is a great way to introduce these

concepts. But it really isn’t the only distance measure

we could have used. There are a lot more of them all

with their own intricacies and advantages. There’s

nearest neighbor, vector cosine, Pearson correlation,

Jaccard coefficient, Hamming distance, Manhattan

distance, and so on …

Since we’ll need them as weights for calculating

Sylvester’s rating on Gusteau’s, could you calculate

the taste distances from Sylvester to everyone else?

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 25

Taste distance follows from putting people according to

their ratings in taste space, where each axis represents

the ratings for a particular restaurant. Since we can

only visual represent two dimensions (restaurants), we

prefer using the common formula for calculating

distances immediately.

The weight each person’s rating gets,

depends on how similar this person’s

taste is to Sylvester’s.

Let’s quickly go over what we’ve been up to until this point, one more time.

We want to make recommendations

to Sylvester. This was translated in

estimating a rating for restaurants he

hasn’t visited. Which is only Gusteau’s

in his case.

1

2

3

4

To get such a rating for Gusteau’s

customized to Sylvester, we want to

calculate a weighted average of all

Gusteau’s ratings

(1 - 3)2 + (3 - ?)2 + (3 - 4) 2 + (0 – 3) 2

(c) Dirk Vandycke - 2012

capturing taste

26 Audition Chapter

Since we’ll need them as weights for calculating

Sylvester’s rating on Gusteau’s, could you calculate

the taste distances from Sylvester to everyone else?

now, let’s take on the distance between Daisy and Sylvester for starters

first of all, let’s grab

a copy of our sample

dataset

(1 - 3)2 + (3 - ?)2 + (3 - 4) 2 + (0 – 3) 2 = 3.74 let’s forget about this

next, we’ll do the distance between Donald and Sylvester

(1 - 3)2 + (4 - ?)2 + (? - 4) 2 + (1 – 3) 2 = 2.83

next, we have the distance between Patrick and Sylvester

(3 - 3)2 + (1 - ?)2 + (2 - 4) 2 + (1 – 3) 2 = 2.83

finally, there’s the distance between Jean and Sylvester left

(? - 3)2 + (4 - ?)2 + (? - 4) 2 + (4 – 3) 2 = 1.00

If we need taste distances as weights for our averaged weighted rating of Gusteau’s for

Sylvester, we’d better start figuring them out at this point.

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 27

I really hate to spoil the fun again. But the closer

people their taste is, the shorter the distance gets

between them. I would love to see a BIGGER number

for people more closely together in taste.

Sharp! But that’s not that big an issue. You

could simply invert the distance. Making

big what is small and vice versa.

But to avoid the risk of division by zero,

we’ll first add 1 to anything we invert.

So if Daisy’s distance to Sylvester is 3.74,

we’ll use 1/(1+3.74) as weight for Daisy’s

rating. That would give us 0.21 as Daisy’s

rating’s weight.

0.26 0.21 0.50 0.26

(c) Dirk Vandycke - 2012

capturing taste

28 Audition Chapter

Dorsia Gusteau's Planet

Arraywood

Half Moon

Utility Restaurant

Daisy 1 3 3 0

Donald 1 4 - 1

Patrick 3 1 2 1

Jean - 2 - 4

Sylvester 3 - 4 3

0.26 0.21 0.50 0.26

0.26 0.21 0.50 0.26

Sylvester’s estimate

rating for Gusteau’s = × 2 × 1 + × 4 + × 3 +

= 1.98

Here we are. Finally! Now we’re able to fill in the weights into the weighted average that will

rate Gusteau’s for Sylvester.

(c) Dirk Vandycke - 2012

collaborative filtering

you are here � 29

Q: So if a rating is not known we just leave it

out?

A: Exactly. Just don’t use that dimension while

crunching the numbers. Don’t forget to leave out the

weight in the divisor’s sum if there’s no matching

rating, though!

Q: Seems like this will only work well if you

have lots of ratings for the same restaurants by

lots of people.

A: Spot on! The more ratings the better the estimates.

But even with only this small sample of test data,

collective intelligence can be already quite useful.

Q: Tell me, won’t people have privacy issues with

gathering their ratings?

A: Not really, all we need is the ratings belonging to

someone. We might anonymize everybody as mister

X, Y, Z. We just need some cookie or session to give

the recommendations to the right person.

Q: But how can we estimate ratings if a new

visitor hasn’t rated anything yet..

A: There are a number of ways to handle this. From

translating like/dislike clicks into ratings to using a

normal average to start with (replacing weights as

soon as behavioral data is captured from the user).

Could you calculate an estimate for all

missing ratings in our test data?

Sylvesters’ estimate rating for Gusteau’s is 1.98

Donald’s estimate rating for Planet Arraywood is

(we already have this one)

Jean’s estimate rating for Planet Arraywood is

Jean’s estimate rating for Dorsia is

(c) Dirk Vandycke - 2012

capturing taste

this is a new chapter � 30

There’s more we didn’t handle in this audition, but definitely will address in the final

chapter.

1. Meat eaters probably will systematically underrate vegetarian restaurants. Some

people will have a slightly different bias in their ratings than others. This surely will

influence Euclidean distance, and hence recommendations. This problem can be

solved by using Pearson correlation instead of Euclidean distance.

2. What about the number of distances that will have to be calculated. Having 5000

users with their ratings, forces us to calculate 5000 distances for a every new user.

And then we wouldn’t have even started calculating the weighted averages yet for

every restaurant that’s in our database. What’s more, each time someone rates an

additional restaurant, that would make all calculated distance values for that person

invalid immediately. These problems can be countered by item-based filtering (as

opposed to the user-based filtering we focused on). This would make for a nice fire

side chat between user-based and item-based filtering.

3. As an extra, people could rate restaurants on different topics (food, service, price,

wine, coziness, …). How would this influence our collaborative filtering?

4. And we haven’t even started programming all of this. Perhaps algorithms

could/should better be explained without interference of discussing

implementation? So I decided to try this out and stay away from coding on this one.

If we explain coding all of this, should we do it in parallel while explaining the

algorithm, or after we explained it?

(c) Dirk Vandycke - 2012