bi canetto

8/2/2019 BI Canetto

1/56

CLAMDA - INTERNATIONAL MANAGEMENTFACULTY OF ECONOMICS

Coffee Drinking Habits in Lithuania, Palestine and Italy

Business Intelligence Written Assignment

Kotryna Garsvaite

Mahran Sharqawi

Marcello Canetto

Professor Furio Camillo

8/2/2019 BI Canetto

2/56

2

CONTENT

Introduction...................................................................................................................................3

1 The Research .........................................................................................................................4

2 Importing Data.......................................................................................................................5

3 Simple Statistics ....................................................................................................................7

4 Principal Component Analysis ..............................................................................................8

5 Size Effects Removal .........................................................................................................12

5.1 Principal Component Analysis after Size Effect Removal ..........................................14

6 The Cluster Analysis ...........................................................................................................18

7 Wards Method ....................................................................................................................20

7.1 Dendrogram - graphical representation........................................................................21

8 T-TEST Procedure...............................................................................................................23

8.1 Preparation of the dataset .............................................................................................23

8.2 T-Test for Respondents ................................................................................................23

8.3 Cluster 1 .......................................................................................................................25

8.4 Cluster 2 .......................................................................................................................26

8.5 Cluster 3 .......................................................................................................................27

9 Proc Freq Procedure (Chi Square Test) ...............................................................................28

9.1 Cluster x Variable ........................................................................................................29

9.2 Country X Variable ......................................................................................................42

10 Strategic Decisions ..........................................................................................................50

10.1 Cluster 1- Sophisticated Coffee and Cigarettes ...........................................................50

10.2 Cluster 2- Fast Coffee ..................................................................................................51

10.3 Cluster 3- Sweet Break or Take-Away ........................................................................52

11 Appendix..........................................................................................................................53

8/2/2019 BI Canetto

3/56

3

Introduction

The energizing effect of the coffee bean plant is thought to have been discovered in thenortheast region of Ethiopia, and the cultivation of coffee first expanded in the Arab world.

The earliest credible evidence of coffee drinking appears in the middle of the 15th century, in

the Sufi monasteries ofYemen in southern Arabia. From the Muslim World, coffee spread to

Italy, then to the rest of Europe.

Coffee had been through many centuries a popular drink. Searching through history pages for

the roots of this amazing drink, can lead to lot of stories, legends, we may say. From the South

American countries as Brazil, through the northeast region of Ethiopia, to the Arab peninsula

reaching Europe, people had used coffee for its stimulating effect on humans due to its caffeinecontent.

Because of the popularity and attractiveness of coffee and possibility to spread the

questionnaire in different countries, we chose as a subject of our research the topic Coffee

Drinking Habits in Lithuania, Palestine and Italy. Our goal is to find the best concept of cafes

for different groups of people in three different counties.
http://en.wikipedia.org/wiki/Ethiopiahttp://en.wikipedia.org/wiki/Arabhttp://en.wikipedia.org/wiki/Sufismhttp://en.wikipedia.org/wiki/Yemenhttp://en.wikipedia.org/wiki/Arabiahttp://en.wikipedia.org/wiki/Muslim_worldhttp://en.wikipedia.org/wiki/Muslim_worldhttp://en.wikipedia.org/wiki/Arabiahttp://en.wikipedia.org/wiki/Yemenhttp://en.wikipedia.org/wiki/Sufismhttp://en.wikipedia.org/wiki/Arabhttp://en.wikipedia.org/wiki/Ethiopia

8/2/2019 BI Canetto

4/56

4

1 The ResearchIn our research, we tried to examine the habits of drinking coffee outside home, in three

different countries, Italy, Lithuania and Palestine. To achieve this goal we have formulated aquestionnaire, mainly aimed to people in these three different countries, located in different

points on the map, having different climates and of course different cultures. Our purpose is to

try to find the differences between the habits in drinking coffee in these three countries, in

addition, to find similarities between specific groups in these countries.

We are also aware of the wide range of respondents, and the effect of other factors to their

answers. Coffee has different perception in these different countries, still, worldwide network

companies such as Starbucks, may have a similar effect on consumers in different places.

Through our questionnaire we tried to get information about drinking coffee habits such as thetype of coffee preferred, times people prefer to drink coffee, and other factors which are

important for people who drink their coffee outside.

The questionnaire was worded as clearly as possible to try and give everyone the ability to

understand the questions and answers with no errors. Another feature is its simplicity; we tried

to structure the questions as simply as possible so that respondents could answer the questions

at the minimum time available.

The result was a questionnaire of 18 questions (Appendix 1).

First we placed four qualitative questions; favorite type of coffee, frequency of drinking coffee,

how many times, and the modality of drinking coffee outside.

Then, we identified 9 factors we consider as important for our research in order to understand

the reasons behind the decisions made by the respondents, we formulated 9 questions and

asked to rate them with a scale from 1 to 10, where 1 is the most negative/ not important

evaluation, 10-the most positive/ important.

These 9 factors are:

1.

Interior and atmosphere of the place.

2. Socializing with people.3. Effect of caffeine.4. Traditional tastes of coffee.5. Importance of the price.

8/2/2019 BI Canetto

5/56

5

6. Smoking.7. "Take-away culture".8. Ice coffee.9. Dessert/croissant.

As our target audience was from three different countries, and it is hard to reach them

physically in order to hand the questionnaire, we reached them through internet. We published

the questionnaire online for four days;

We also placed the generic questions in the end of the questionnaire in order to identify our

respondents

At the end our sample was 157 useful observations, around 40-60 from each country. As the

most active were Lithuanians, the most passive- Palestinians.

The data were originally cataloged by Microsoft Excel and later imported into

SAS software.

2 Importing DataThe first step was to import the data from Excel to SAS using the import function Wizard.

Then we renamed and labeled the questions as follows:

ID id

1. Question m_1 (type of coffee) we gave it the label type2. Question m_2 (time of drinking) we gave it the label time3. Question m_3 (times drinking) we gave it the label times a day4. Question m_4 (preferred drinking way) we gave it the label way to drink5. Next we placed 9 sub questions which asked the importance of different

factors in choosing for the respondents.

6. Question s_1 (interior and atmosphere) we gave it the label interior7. Question s_2 (socializing with people) we gave it the label socializing8. Question s_3 (effect of caffeine) we gave it the label caffeine9. Question s_4 (traditional tastes) we gave it the label tastes10.Question s_5 (importance of price) we gave it the label price

8/2/2019 BI Canetto

6/56

6

11.Question s_6 (relating smoking) we gave it the label smoking12.Question s_7 (take away cultural) we gave it the label take away13.Question s_8 (ice coffee) we gave it the label ice coffee14.Question s_9 (dessert/croissant) we gave it the label dessert15.Question Country we gave it the label country16.Question Gender we gave it the label sex17.Question Age we gave it the label age18.Question 8 Occupation we gave it the label occupation19.Question 9 Smoking we gave it the label smoker

To do so, we had the following commands in SAS:

data Coffee.Coffee;

set Coffee.Coffee;

label id='id'

m_1='time'

m_2='type'

m_3='times a day'

m_4='way to drink'

s_1='interior'

s_2='socializing's_3='caffeine'

s_4='tastes'

s_5='price'

s_6='smoking'

s_7='take away'

s_8='ice coffee'

s_9='dessert'

country='country'

gender='gender'

age='age'

occupation='occupation'

smoker='smoker'

run;

Our data was ready for the analysis of the values in the respective tables.

8/2/2019 BI Canetto

7/56

7

3 Simple StatisticsWhen our data was, sorted, and renamed, we started with the first data analysis.

The first procedure that we started with is the PROC MEANS. To do this we gave SAS the

command:

procmeansdata=Coffee.Coffee nmeanstddevstddevstderrmediancv;

var s_1-s_9;

run;

With this procedure we can know the number (n), the average (mean), the

standard deviation (stddev), the standard error (stderr), the median (median) and the coefficient

of variation (cv).

The MEANS Procedure

Coeff of

Variable Label N Mean Std Dev Std Error Median Variation

s_1 interior 157 6.8343949 2.4543189 0.1958760 8.0000000 35.9112829

s_2 socializing 157 6.3312102 2.4215606 0.1932616 7.0000000 38.2479888

s_3 caffeine 157 6.5796178 2.6339193 0.2102096 7.0000000 40.0314930

s_4 tastes 157 4.1528662 3.0237640 0.2413226 3.0000000 72.8114953

s_5 price 157 5.3630573 2.6485431 0.2113767 5.0000000 49.3849478

s_6 smoking 143 4.7482517 3.8409736 0.3211983 3.0000000 80.8923740

s_7 take away 157 6.1401274 2.8699562 0.2290474 6.0000000 46.7409882

s_8 ice coffee 157 5.5732484 3.1136668 0.2484977 6.0000000 55.8680781

s_9 dessert 157 5.3503185 2.7939179 0.2229789 5.0000000 52.2196568

In this table, we marked the highest means with lowest standard deviation, which give us a

clear perspective of the important variables in our research. For example, high value in interiorvariable means that respondents give this factors a high importance- choosing the place for

drinking coffee. Moreover, respondents gave high importance for Socializing, Take- Away

option. That tells that for those who are choosing to drink coffee the important factors are

connected to three factors which are not connected to coffee itself, but to the habit of drinking

coffee. Another important fact is that Caffeine is one of the highest four mean values we got,

which tells, that there is a part of respondents relating coffee with its primary feature- caffeine.

8/2/2019 BI Canetto

8/56

8

4 Principal Component AnalysisIn this part we try to find out possible relationship between different variables. In other words

we want to see if there is any relationship between the different possible answers to the

questionnaire. To do this you need to do a multivariate analysis of responses, and this is donethrough principal component analysis.

In SAS we use program:

procprincompdata=Coffee.Coffee;

var s_1-s_9;

run;

In this way we will have 3 useful results to be analyzed:

1) Correlation coefficientsCorrelation Matrix

s_1 s_2 s_3 s_4 s_5 s_6 s_7 s_8 s_9

s_1 interior 1.0000 0.5319 -.1563 0.2340 0.1566 -.0505 0.3069 0.0336 0.1214

s_2 socializing 0.5319 1.0000 -.0171 0.3303 0.1707 -.0478 0.3081 0.0174 0.1539

s_3 caffeine -.1563 -.0171 1.0000 0.0022 0.1269 -.0377 -.0592 0.0135 0.1890

s_4 tastes 0.2340 0.3303 0.0022 1.0000 0.1767 -.1145 0.3631 0.1942 0.2501

s_5 price 0.1566 0.1707 0.1269 0.1767 1.0000 0.1137 -.0102 0.0520 0.0889

s_6 smoking -.0505 -.0478 -.0377 -.1145 0.1137 1.0000 -.0909 -.0998 -.1030

s_7 take away 0.3069 0.3081 -.0592 0.3631 -.0102 -.0909 1.0000 0.3366 0.1401

s_8 ice coffee 0.0336 0.0174 0.0135 0.1942 0.0520 -.0998 0.3366 1.0000 0.2037

s_9 dessert 0.1214 0.1539 0.1890 0.2501 0.0889 -.1030 0.1401 0.2037 1.0000

The first observation concerning the correlation coefficients is that they are mixed between

positive and negative values; the majority of the values are positive, and only around 10 cases

we had a negative values. We highlighted the highest values in yellow and the lowest in light

blue.

Regarding this time the positive values, the highest value is 0.5319 and it indicates the

correlation between Socializing and Interior/Atmosphere, we can conclude that respondents

who gave importance to the Socializing with people while drinking coffee, gave a big

importance too to the Interior of the cafe and vice versa.

8/2/2019 BI Canetto

9/56

9

Even when we had only one correlation above 0.5 we still consider values above 0.2 as high

values, as we are testing a wide range of respondents. Continuing in the standings to second

place we find the correlation between Tastes and Take -Away which had a high value of

0.3631. We assume that respondents who like to take their coffee away with them, care about

the different tastes of coffee.

The third place in our analysis is the correlation between Ice Coffee and Take-Away, with a

value of 0.3366, it could be concluded from this that the ice coffee lovers, take it away.

Moreover, Take-Away people, next to mentioned before different tastes, like also Ice Coffee.

Another positive correlation is Tastes and Dessert (0,25), respondents who like different,

probably sweet tastes of coffee, do not refuse also dessert.

The last high value in our analysis in this table is the correlation between Tastes and

Socializing. It says that people like spending time with others in a coffee place, which can offer

various tastes. Furthermore, Tastes have another positive correlation of 0.234 with Interior.

The negative values show the features, which do not correlate with each other (Caffeine and

Socializing, Smoking with Tastes, Take-Away, Ice Coffee, Dessert).

The observations give us the first view of the trends in our research, which will counted more

precisely in later calculations.

The values are not all positive, which minimize the possibility of having an error called size

effect. Still the data must be corrected to verify any existence error and its possible influence

on the results obtained. To do this we will use a procedure which will be shown later.

Now we continue with the second part of the PRIN COMP analysis.

2) Correlation matrix eigenvaluesWe noted the presence of positive correlation between the variables, but we also got negative

correlation, still we decided to try to eliminate the size effect. For simplicity of the procedure

PRIN COMP in SAS generates new vectors defining a new vector system that is composed of

new, independent and unrelated dimensions. Each principal component is the linear

combination of original variables with the coefficient equal to eigenvector of the correlation

matrix.

8/2/2019 BI Canetto

10/56

10

Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

1 2.30618151 0.98494347 0.2562 0.2562

2 1.32123803 0.09821232 0.1468 0.4030

3 1.22302572 0.23309129 0.1359 0.5389

4 0.98993443 0.19748232 0.1100 0.6489

5 0.79245210 0.05262728 0.0881 0.7370

6 0.73982482 0.03803349 0.0822 0.8192

7 0.70179132 0.20455790 0.0780 0.8972

8 0.49723342 0.06891478 0.0552 0.9524

9 0.42831865 0.0476 1.0000

The first column shows the length of the eigenvalue of the principal components.

We are interested in considering the eigenvalues to determine the importance of Principal

Components. The first 3 eigenvalues have a value greater than one and therefore the most

significant. However, considering the variance, we note that considering only the first three

would stop at 53% of variance explained.

Other components show lower importance, but still represent 5% and more variables. Ourtarget is not specific so not to loose information we consider all the 9 principal components,

which let us to explain the total variance.

8/2/2019 BI Canetto

11/56

11

3) EigenvectorsPrin1 Prin2 Prin3 Prin4

s_1 interior 0.434038 -.417329 0.046529 -.169922

s_2 socializing 0.461075 -.313510 0.160018 -.240113

s_3 caffeine -.010706 0.517591 0.462081 -.245532

s_4 tastes 0.447888 0.103031 0.018111 0.051922

s_5 price 0.187179 -.016211 0.637148 0.248980

s_6 smoking -.129733 -.273119 0.374877 0.665444

s_7 take away 0.441400 0.033958 -.319502 0.240765

s_8 ice coffee 0.260091 0.421480 -.287120 0.516211

s_9 dessert 0.289749 0.442015 0.165447 -.14574

Prin5 Prin6 Prin7 Prin8 Prin9

s_1 0.079371 -.036478 0.389923 -.050380 0.666483

s_2 0.143574 0.169851 0.091209 0.437747 -.597040

s_3 0.180006 0.609017 0.062985 0.064716 0.216151

s_4 -.151813 -.070963 -.787577 0.244862 0.278280

s_5 -.606727 -.119484 0.125885 -.279837 -.142138

s_6 0.546455 0.030417 -.112881 0.092286 0.066560

s_7 0.141779 0.417523 -.065994 -.640667 -.186355

s_8 -.155812 0.007970 0.418987 0.450530 0.054111

s_9 0.454455 -.635840 0.083081 -.203562 -.113544

In the first column Prin1 we count 7 variables positively correlated and 2 variables negativelycorrelated. This observation shows us that there is not such a significant size effect. However,

before further considerations, we will erase the size effect to improve the result of our analysis.

8/2/2019 BI Canetto

12/56

12

5 Size Effects RemovalThe operation of size effects removal finds reason in the fact that the values which has been

allocated to the factor of the questionnaire depend to the average value of the judgments of oneperson. These values can greatly change so we'll find very low values in all people have a

more pessimistic view, while higher values in those who are more accustomed to giving high

values to different parameters (optimistic).

The process of size effects removal is made through standardization procedure in SAS. We

start creating 9 new variables (n_1- n_9), they will represent the new values of the 9 scale

questions. These values will be centered to the average value of each

individual. SAS will calculate the maximum, the minimum and the average value or each

individual. Consequently the software will standardize the answers given in a range between -1

and +1. The average value will be represented by 0.

We gave to SAS the following command:

data data Coffee.Coffee_1;

set data Coffee.Coffee;

if _n_

8/2/2019 BI Canetto

13/56

13

Now, using our new database, we repeat the initial procedures to control the actual difference

between the original database and the corrected one.

As before, we are going to use the PROC MEANS procedure so in our SAS program we will

write down:

procmeansdata=Coffee.Coffee_1;

var n_1-n_9;

run;

The result is:

The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum

n_1 157 0.3196513 0.5870416 -1.0000000 1.0000000

n_2 157 0.1830072 0.5670946 -1.0000000 1.0000000

n_3 157 0.2677898 0.6738488 -1.0000000 1.0000000

n_4 157 -0.3539103 0.6861465 -1.0000000 1.0000000

n_5 157 -0.0736206 0.6409227 -1.0000000 1.0000000

n_6 157 -0.1591141 0.8576755 -1.0000000 1.0000000

n_7 157 0.1395236 0.6865145 -1.0000000 1.0000000

n_8 157 0.0133477 0.7477820 -1.0000000 1.0000000

n_9 157 -0.0531759 0.6824755 -1.0000000 1.0000000

Initially we notice that after the standardization the values are within values -1 and +1

The standardization doesnt show surprising effects. In fact the three factors with the highest

value didnt change: Interior, Socializing and Caffeine. However the factor Take-Away,

which it is still at the fourth place, lost its decisive role as factor and loses importance in theanalysis.

Looking at the negative signs (highlighted with blue) we report in order: Tastes, Smoking and

Price. The trend is again similar to the one seen in the procedure before size effect removal..

8/2/2019 BI Canetto

14/56

14

5.1 Principal Component Analysis after Size Effect RemovalAfter size effects removal we can repeat the Principal Component procedure using the new

more precise database.

In SAS program we write:

Procprincompdata=Coffee.Coffee_1 out=Coffee.cluster;

var n:;

run;

Correlation Matrix

n_1 n_2 n_3 n_4 n_5 n_6 n_7 n_8 n_9

n_1 1.0000 0.3349 -.1629 -.0016 0.0090 -.1434 0.1069 -.1875 -.0245

n_2 0.3349 1.0000 -.1091 0.1039 0.0056 -.2060 0.0633 -.2269 -.0505

n_3 -.1629 -.1091 1.0000 -.2004 0.0376 -.1189 -.2219 -.0620 0.0918

n_4 -.0016 0.1039 -.2004 1.0000 -.0055 -.2079 0.1704 0.0148 0.1436

n_5 0.0090 0.0056 0.0376 -.0055 1.0000 0.0012 -.1916 -.1977 -.0554

n_6 -.1434 -.2060 -.1189 -.2079 0.0012 1.0000 -.2312 -.1898 -.1804

n_7 0.1069 0.0633 -.2219 0.1704 -.1916 -.2312 1.0000 0.1870 -.0672

n_8 -.1875 -.2269 -.0620 0.0148 -.1977 -.1898 0.1870 1.0000 0.0374

n_9 -.0245 -.0505 0.0918 0.1436 -.0554 -.1804 -.0672 0.0374 1.0000

The outcome of this correlation matrix shows some differences respect to the previous one

without size effects removal.

The highest correlation value (0,3349) is Socializing with Interior. The second highest

correlation (0,1870) matches up Ice coffee with Take-Away. The third one (0,1704) correlates

Take-Away with different Tastes.

Analysing the correlations we can argue that the factors more correlated are similar or can have

logic correlation: if a person links coffee with socializing with other people, then he will

choose a caf with nice interior to spend his time with another person.

Further Ice coffee is a long coffee which needs time to be finished so people can take it away

to drink it slowly.

8/2/2019 BI Canetto

15/56

15

Furthermore the negative correlations let us to make some important conclusions. Firstly, the

highest negative correlation (-0,2312) shows that people do not link Smoking with Take-Away.

Also the negative correlation (-0,2269) shows how people do not relate Ice coffee with

Socializing.

Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

1 1.73942362 0.19850007 0.1933 0.1933

2 1.54092355 0.26656185 0.1712 0.3645

3 1.27436170 0.23053503 0.1416 0.5061

4 1.04382667 0.11817241 0.1160 0.6221

5 0.92565426 0.15262272 0.1029 0.7249

6 0.77303154 0.09324188 0.0859 0.8108

7 0.67978966 0.09613972 0.0755 0.8863

8 0.58364994 0.14431088 0.0648 0.9512

9 0.43933906 0.0488 1.0000

The eigenvalues of the new correlation matrix has 4 principal components with values greater

than unity that we want to consider. However, they explain only 62% of the information, which

would lead us to consider at least 5 components.

After some considerations we still decided to choose all the 9 variables for our analysis

because our sample is not perceived as homogeneous and specific. It is composed by people

from three different countries and subsequently diverse cultures.

We need all the principal components because we require specific and detailed data. We want a

macro view of the costumer as the scope of this research is to make some macro marketing and

strategic general decision about opening cafs in different countries.

8/2/2019 BI Canetto

16/56

16

Eigenvectors

Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9

n_1 0.371810 0.427576 -.021365 -.273788 -.143199 0.477643 0.150053 -.541602 0.197539

n_2 0.394858 0.444974 0.100454 -.181705 -.054699 -.283115 -.471510 0.454429 0.301668

n_3 -.324870 -.052409 0.533502 -.428624 0.124845 -.367410 0.230515 -.208916 0.419537

n_4 0.389770 -.104202 0.143501 0.644117 -.002649 -.407235 -.032695 -.437329 0.208731

n_5 -.165954 0.338640 0.207353 0.394213 0.673584 0.353772 0.084423 0.177793 0.204390

n_6 -.429974 0.152762 -.489183 0.222761 -.358620 -.021224 0.084132 0.051592 0.603378

n_7 0.469938 -.245918 -.263986 -.124679 0.197221 -.081206 0.649893 0.339844 0.223837

n_8 0.096979 -.607624 -.094785 -.145799 0.226310 0.331441 -.494236 -.086094 0.422368

n_9 0.075611 -.194617 0.568567 0.227133 -.537096 0.385772 0.141847 0.328755 0.126714

The principal component analysis is a method to reduce the dimensionality of

data. Whether there are two or more variables that explain the same phenomenon, the purpose

is to find a summary of their information between the variables. Then where we have

correlated variables we introduce the use of principal components.

The statistical area Rp, where p is the number of variables, will be reduced in an area Rx with x

8/2/2019 BI Canetto

17/56

17

As we have seen we have 9 theoretical PC; increasing the order of the draw the variance of the

components decreases and this indicates that lose their importance (significance). It happens

because the way in which PC are built gives maximum variance to the first, declining

to the followings. However, each PC will bring a new informative content. We decided to use

for our analysis all the principal components because they all have weight equal or higher than

5% and, as already mentioned, our sample is not homogeneous so all the collected information

is useful for our research.

8/2/2019 BI Canetto

18/56

18

6 The Cluster AnalysisThe next step is to divide the sample in clusters, also called segments. The purpose

of this analysis is to find groups of people with characteristics sufficiently similar with respectto one or more variables (in this case the segmentation is carried out according to the scale

questions of our questionnaire).

These groups should present a very small variance within them (homogeneity of the cluster)

and, at the same time, a significant variance among them (the clusters must be as diverse as

possible so we can give them a meaning).

In this way we should create homogenous groups of potential customers, easy to find and

study, for example, for marketing purposes.

We have to point out that the numerosity of cluster will be influenced by the purpose of

research: in the case of marketing and strategic general choices, the number of

clusters should be low- we are trying to find best concepts for cafes, and making too many

groups could result to be too various to realize. Vice versa, in the case of micro marketing and

operational decisions the number of clusters should be very high to be effective.

The first step of the cluster analysis is the creation of a distance matrix in which the

observations are the rows and columns while the cells represent the measure of similarity or

distance for each pair of observations. The distance matrix is a tool which puts in relation the

observations in a matrix NxN obtaining the distance in relative terms and the distance andsimilarity between the observations.

Example of distance-matrix

a b c d e

a 0 16 1 9 10

b 16 0 17 25 2

c 1 17 0 4 9

d 9 25 4 0 13

e 10 2 9 13 0

There are various ways of measuring the distance between the observations, the distance

between clusters and their similarity. Then these elements are used as selection criteria for

8/2/2019 BI Canetto

19/56

19

deciding whether merge or not to different clusters: the distance measures when the

observations can be considered different. In a hierarchical clustering the initial distances

between observations arranged in the matrix are progressively reduced by the method of the

merger among observations with the shorter distance (minimum distance fusion).

Under this method, in our example, customers a and c are the most similar (value equal to 1).

So the customers a and c will merge into a new unit f. This new unit will have as values of the

distances, respect to other units, the minimum existent values between the two old units.

Following the example we have:

f b d e

f 0 16 4 9

b 16 0 25 2

d 9 25 0 13

e 9 2 13 0

Now the most similar customers are b and e, so they will merge in a new unit g, and so on until

it just remain a single variable. Ant this theory leads to Wards Method.

8/2/2019 BI Canetto

20/56

20

7 Wards MethodIn this analysis we will use the Ward's method that allows us to create groups merging

observations when the distance between the two is minimal. This distance is calculated as thesum of Euclidean distances squared. The distance will then be calculated using the Pythagorean

Theorem.

Sum of squared errors= SSE= i j k (Xi,jk Yi,jk )

Therefore this method will tend to maximize the so-called variance between (or among the

different clusters) and to minimize the within (i.e. within the cluster):

VAR (X)= Var Within+ Var Between

Var within= [ (Xip- Xp)] / Np The sum of the values of the ith unit -

average value of group

divided for the whole group

Var btw= [ (Xp- X)2 Np ] / Ntot The sum of the average value of the group

(cluster) - total average value) multiplied

by the whole group] divided

the whole population

Below there is the formula we used with SAS to create the dataset Coffee.tree.

procclusterdata=Coffee.clustermethod=wardouttree=Coffee.tree;var Prin1-Prin9;id id;run;

Then we utilized this dataset to draw the dendrogram which helped us to find the number ofsignificant clusters for our analysis.

8/2/2019 BI Canetto

21/56

21

7.1 Dendrogram - Graphical RepresentationIn clustering procedure, the dendrogram is used to provide a graphical representation of the

process of grouping observations. It provides a graphical representation of the relative distance

to which the statistics units are melted together. The graph is represented in a Cartesian plane

with axis X as the logical distance of the cluster according to the measure defined and axis Y

the hierarchical level of aggregation or Fusion Distance.

The choice of the hierarchical level defines the number of clusters adoptable for the analysis.

The observations will be aggregated and distributed among them according to their degree of

proximity, the further are the observations on the axis X, lower is the possibility that they can

aggregate into the same cluster.

However, this probability is also function of the level offusion distance accepted, and since it

is well known that higher is the level of the hierarchy chosen, lower is the number of clustersfound.

Regarding to our research, we will use the dendogram to identify the relevant number of

clusters to examine.

Now it is possible, therefore, to create the dendrogram through SAS. Trying a few different

numbers of clusters, we stayed at the number of 3. The result of the procedure is the graph

below which was appropriately cut to identify these 3 different clusters. The cut was made at a

point r = 0.06, indicating a pretty good level of accuracy. The dendogram could be cut below,

making higher number of clusters, but to our strategic decisions, as mentioned before, thelower number serves better.

The procedure in SAS used:

proctreedata=Coffee.tree nclusters=3out=Coffee.parti3; id id; run;

8/2/2019 BI Canetto

22/56

22

Semi-PartialR-

Squared

0.000

0.025

0.050

0.075

0.100

id

16768107175124

96688889123

115

126

4129

7897394563142

74153

238120

20131

9216444760549105

90151

157

2236144

70113

4394103

108

4653147

86111

84132

1185146

100

150

12118

136

116

127

1379110

133

107

119

14139

140

55596272101

2637138

17569533299358109

2132616925112

18125

4882141

40134

102

64156

152

65148

32773121

152351243134122

143

128

135

950137

104

30543577155

72842106

91768098130

145

154

1981528741114

665783149

117

99

SAS Dendogram. 3 Clusters

8/2/2019 BI Canetto

23/56

23

8 T-TEST Procedure8.1 Preparation of the datasetFollowing the identification of several clusters we now understand what the characteristicsdistinctive of each group found are.

To do this we created generic three clusters in which we included all the answers received.

This allows us to compare the mean scores of each cluster with the average general results and

understand how and what clusters are different from general comments

8.2 T-Test for RespondentsTo study the relationship between quantitative variables and the segments that we have

obtained using the PROC CLUSTER we can use the T-test.

This test allows us to calculate a value T that is associated with the exact probability of the

variable being tested, the average calculated for the cluster differs from the average calculated

on the entire sample only by chance. It follows this more likely, indicated in SAS Pr> | t |, is

small, the lower the probability that the difference between the means is caused by the random

effect, and then increased the probability that the variable is instead significant to explain that

cluster.

As for the significance level to the target value in literature is 0.05. The null hypothesis is then

accepted if the probability is less than 0.05, with a 95% assurance.

The t-value found in the back edge was calculated as follows: t = (c - tot) / sc sc = standard

error, estimation of variability of the estimator mean.

Now some words should be said about the variance, having no course available the variance of

the population, an estimator must be used of the variance. SAS calculates the T test using two

methods, which differ just for the treatment of variance proceeds from which will then be used

in the standard error T-test

The Satterthwaite method is calculating the standard error forn dividing the weighted average

of the two variances (of the cluster and population). This method does not place the assumptionof equality of variances, and can be applied in all circumstances. The Pooled method differs

from the previous one obtained the standard error from the arithmetic mean of the two

variances, and doing so requires equality variances: the result that the latter can only be applied

in specific circumstances, namely when the result of the equality of variances F-test confirms

the null hypothesis. Considering then that in the event of equality of variances, the two

methods produce the same value of T; it seems more efficient to use the Satterthwaite method.

8/2/2019 BI Canetto

24/56

24

Before further calculations, the data should be sorted and merged in order to create general

cluster 4 to be able to pompare data in T-test:

procsortdata=Coffee.parti3;by id;

run;

data Coffee.compare;

merge Coffee.cluster

Coffee.parti3;

by id;

run;

data Coffee.compare1;

set Coffee.compare;

cluster=4;

run;

data Coffee.compare3;

set Coffee.compare Coffee.compare1;

run;

It is possible now to check in each cluster, which variables can be used to describe the

specificity, describing in turn the direction and strength of this relationship, if any.

8/2/2019 BI Canetto

25/56

25

8.3 Cluster 1procttestdata=Coffee.compare3;

var n_1-n_9;

class cluster;

where cluster=1 or cluster=4;

run;

8.3.1 Features:Method Variances DF t Value Pr > |t|

1 Satterthwaite Unequal 131.32 3.24 0.0015interior

2 Satterthwaite Unequal 108.6 1.60 0.1124socializing

3 Satterthwaite Unequal 90.227 -2.45 0.0161caffeine4 Satterthwaite Unequal 97.436 -0.96 0.3406

5 Satterthwaite Unequal 108.28 -0.71 0.4784

6 Satterthwaite Unequal 123.69 5.53

8/2/2019 BI Canetto

26/56

26

It leads us to a conclusion in case opening the cafe, that respondents in Cluster 1 value a lot a

possibility to smoke, interior of the caf and socializing during the process. While for their

decision where to drink coffee is not important if there is a choice of Ice Coffee, is the coffee is

strong or not, or if they have a possibility to take-away.


var n_1-n_9;

class cluster;


run;


1 Satterthwaite Unequal 69.053 -3.96 0.0002interior

2 Satterthwaite Unequal 85.885 -5.26

8/2/2019 BI Canetto

27/56

27


var n_1-n_9;

class cluster;


run;


1 Satterthwaite Unequal 108.85 1.19 0.2367

2 Satterthwaite Unequal 107.46 2.96 0.0037socializing

3 Satterthwaite Unequal 104.54 1.02 0.3121

4 Satterthwaite Unequal 94.269 2.99 0.0036tastes

5 Satterthwaite Unequal 89.599 1.77 0.0802price

6 Satterthwaite Unequal 199.78 -10.11 |t|).

Analyzing the t-value, we find that except the variable Smoking, all the other variables had a

positive influence on the choice of the members of this group, which means they care about the

named features during their drinking coffee time. While Smoking here is not important.

We can see from the results that the variables Tastes, Socializing and Take-Away have the

biggest effect, slightly lower effect- Dessert. The less effective one was the Price among these

five variables. It is clear that we are dealing with people who like to have Coffee of different

tastes, taking it with dessert and they associated with socializing with other people.

8/2/2019 BI Canetto

28/56

28

9 Proc Freq Procedure (Chi Square Test)After determining three clusters and their particular characteristics, it is important to compare

each cluster with qualitative characteristics in order to know more about each cluster.Moreover we can compare two qualitative variables with each other, to understand better our

sample. For this calculation we will use PROC FREQ procedure and CHI SQUARE (Chisq)

test.

Chisqprovides chi-square tests of independence of each stratum and computes measures of

association. The chi-square test is used when you have one variable/group (cluster) and

compare it with two or more values (sex, country, age, etc.). The observed counts of numbers

of observations in each category are compared with the expected counts, which are calculated

using some kind of theoretical expectation.

Firstly the null hypothesis is that variables are independent with each other (cluster and

country, age, etc.), opposite hypothesis is that variables are not independent- correlate with

each other. Analyzing each frequency, the statistical null hypothesis is that the number of

observations in each category is equal to that predicted, and the alternative hypothesis is that

the observed numbers are different from the expected. The test will let us to confirm or reject

major hypothesis, that the clusters and chosen variable are independent, furthermore, compare

frequencies of each group in each cluster.

The test statistic is calculated by taking an observed number (O), subtracting the expected

number (E), and then squaring this difference. The larger the deviation from the null

hypothesis, the larger the difference between observed and expected is. Squaring the

differences makes them all positive. Each difference is divided by the expected number, and

these standardized differences are summed.

The shape of the chi-square distribution depends on the number of degrees of freedom. For an

extrinsic null hypothesis, the number of degrees of freedom is simply the number of values of

the variable, minus one. The degrees of freedom in a test of where there are more than one

nominal variable, the degree of freedom is equal to (number of rows)1 (number of

columns)1; in our case 43 table, there are (41)(31)=6 degrees of freedom.

In practice, the main hypothesis for evaluating each variable, comparing it with cluster is:

H0 = variables are independent:

H1 = variables are not independent.
http://udel.edu/~mcdonald/statvartypes.html#nominalhttp://udel.edu/~mcdonald/stathyptesting.html#nullhttp://udel.edu/~mcdonald/stathyptesting.html#nullhttp://udel.edu/~mcdonald/statvartypes.html#nominal

8/2/2019 BI Canetto

29/56

29

We say that variables are independent, and we confirm hypothesis H0, when CHI SQUARE

probability p > 0.05 confidence level (we choose 95% confidence level as a default), otherwise

we reject H0 and take H1.

9.1 Cluster x VariableIn this part of the analysis we will compare each cluster with all qualitative characteristics,

firstly all from generic questions, than questions m_1-m_4. We will not compare cluster just

with variable occupation, as it does not give a lot of information, knowing that the majority

of respondents are students.

9.1.1 Cluster x CountryOur biggest difference among people, who participated in the survey, is country, as they have

different culture and habits. Firstly we will compare each cluster with the country, where they

live.

The program we use is SAS is:

procfreqdata=Coffee.compare3;

table cluster*country / allexpected;

FrequencyExpectedPercentRow PctCol Pct Italy Lithuania Palestine Total

1 15 24 18 57 18.516 22.873 15.611 4.78 7.64 5.73 18.15 26.32 42.11 31.58 14.71 19.05 20.93

2 23 13 10 46 14.943 18.459 12.599 7.32 4.14 3.18 14.65 50.00 28.26 21.74 22.55 10.32 11.63

3 13 26 15 54 17.541 21.669 14.79 4.14 8.28 4.78 17.20 24.07 48.15 27.78

12.75 20.63 17.44

4 51 63 43 157 51 63 43 16.24 20.06 13.69 50.00 32.48 40.13 27.39 50.00 50.00 50.00

Total 102 126 86 31432.48 40.13 27.39 100

8/2/2019 BI Canetto

30/56

30

Statistics for Table of CLUSTER by country

Statistic DF Value Prob

Chi-Square 6 9.6280 0.1412

Likelihood Ratio Chi-Square 6 9.3295 0.1559Mantel-Haenszel Chi-Square 1 0.0052 0.9428

Phi Coefficient 0.1751

Contingency Coefficient 0.1725

Cramer's V 0.1238

We see that Chi-Square prob- 0.1412> 0,05 (Chi-square value- 9,626, with 6 degree of freedom).

We confirm the zero hypotheses H0 and state that cluster and variable country are independent.

On the other hand, analyzing the fields one by one, firstly we see that probability of independence

is 14%, which is not very high and Chi-Square value is more than 9, not extremely low, there might

be some relationships. We can find some differences in each cluster between expected value and

the real frequency:

Cluster 2: has frequency of 23 Italians, instead of expected 15, which is 50% instead of32%, which means that the second cluster has features more common to Italians. Moreover,

the same cluster has a little bit lower than expected frequency of Lithuanians, which is 28%

instead of 40%. So we see that these cluster characteristics are not so common to

Lithuanians. Palestinians do not have significant difference between expected and real

frequency.

Cluster 3 does not have really significant differences from expected value. A little bitlower frequency than expected there we find of Italians. 24% instead of 32%, and a little bit

more than expected Lithuanians, 48% instead of 40%.

All in all, Cluster 1 is common for all three countries. Cluster 2 is more suitable for Italians,

Palestinians also do not reject it. Cluster 3 reflects more Lithuanian habits, Palestinians do not

reject it, Italians preferences slightly differ here.

8/2/2019 BI Canetto

31/56

31

9.1.2 Cluster x GenderSecond characteristic, by which we will compare clusters is gender (sex). Here we use a SAS

program:

procfreqdata=Coffee.compare3;table cluster*sex / allexpected

run;

FrequencyExpectedPercentRow PctCol Pct Female Male Total

1 32 25 57 32.312 24.688 10.19 7.96 18.15 56.14 43.86 17.98 18.38

2 25 21 46 26.076 19.924 7.96 6.69 14.65 54.35 45.65 14.04 15.44

3 32 22 54 30.611 23.389 10.19 7.01 17.20 59.26 40.74 17.98 16.18

4 89 68 157 89 68 28.34 21.66 50.00 56.69 43.31

50.00 50.00

Total 178 136 31456.69 43.31 100.00

Statistics for Table of CLUSTER by sex


Chi-Square 3 0.2550 0.9683

Likelihood Ratio Chi-Square 3 0.2553 0.9682

Mantel-Haenszel Chi-Square 1 0.0272 0.8689



Cramer's V 0.0285

These results indicate that there is no statistically significant relationship between cluster and

gender (chi-square with 3 degree of freedom = 0.2550, p = 0.9683). The probability of no

correlation is 96% and the Chi-square value is very low; we clearly see that there is no any

correlation between cluster and gender, all three clusters features are acceptable for both

genders.

8/2/2019 BI Canetto

32/56

32

9.1.3 Cluster x AgeAnother variable to check is age, counted with the program:


table cluster*age / allexpected;

run;

Frequency

Expected

Percent

Row Pct

Col Pct14-19 20-24 25-29 30-39 =>40 Total

1 1 26 21 5 4 57 1.0892 31.223 18.153 3.9936 2.5414

0.32 8.28 6.69 1.59 1.27 18.15

1.75 45.61 36.84 8.77 7.02

16.67 15.12 21.00 22.73 28.57

2 1 30 10 3 2 46 0.879 25.197 14.65 3.2229 2.051

0.32 9.55 3.18 0.96 0.64 14.65

2.17 65.22 21.74 6.52 4.35

16.67 17.44 10.00 13.64 14.29

3 1 30 19 3 1 54 1.0318 29.58 17.197 3.7834 2.4076

0.32 9.55 6.05 0.96 0.32 17.20

1.85 55.56 35.19 5.56 1.85

16.67 17.44 19.00 13.64 7.14

4 3 86 50 11 7 157

3 86 50 11 7 0.96 27.39 15.92 3.50 2.23 50.00

1.91 54.78 31.85 7.01 4.46

50.00 50.00 50.00 50.00 50.00

Total 6 172 100 22 14 314

1.91 54.78 31.85 7.01 4.46 100.00

Statistics for Table of CLUSTER by age


Chi-Square 12 6.0238 0.9149




Cramer's V 0.0gg800

WARNING: 50% of the cells have expected counts less

than 5. Chi-Square may not be a valid test.

8/2/2019 BI Canetto

33/56

33

These results shuffler that there is no statistically significant relationship between cluster

attended and gender (chi-square with 12 degree of freedom = 6,0238, p = 0.9149).

On the other hand, SAS suggest that 50% of the cells have expected counts less than 5 and

Chi-Square may not be a valid test. For further calculations The Fishers test should be used.

The Fisher's exact test is used when you want to conduct a chi-square test, but one or more of

your cells has an expected frequency of five or less. Remember that the chi-square test

assumes that each cell has an expected frequency of five or more, but the Fisher's exact test has

no such assumption and can be used regardless of how small the expected frequency is.

We could use the program as follows:

proc freq data = Coffee.comapre3;tables cluster*age / fisher;

run;

On the other hand we clearly see that the majority of our respondents are 20-29 years old

(82%), so we focus on young people overall and further calculations are not necessary.

8/2/2019 BI Canetto

34/56

34

9.1.4 Cluster x SmokerOpening a cafe it is important to know, how the respondents relate with smoking, in order to

prepare places for smokers or not invest in it. Firstly we will determine the frequencies of

smokers in each cluster.


table cluster*smoker/ allexpected;

run;

Frequency

Expected

Percent

Row Pct

Col Pct No Yes Total

1 19 38 57

33.764 23.236

6.05 12.10 18.15 33.33 66.67

10.22 29.69

2 25 21 46

27.248 18.752

7.96 6.69 14.65

54.35 45.65

13.44 16.41

3 49 5 54

31.987 22.013

15.61 1.59 17.20

90.74 9.26

26.34 3.91

4 93 64 157

93 64

29.62 20.38 50.00

59.24 40.76

50.00 50.00

Total 186 128 314

59.24 40.76 100.00

Statistics for Table of CLUSTER by smoker


Chi-Square 3 38.4895

8/2/2019 BI Canetto

35/56

35

This time results show that that there is statistically significant relationship between cluster and

smoking habits (chi-square with 3 degree of freedom = 38.49, p =

8/2/2019 BI Canetto

36/56

36

9.1.5 Cluster x The Time of the DayIn this part we will compare, if there is any link between time of the day to drink coffee (m_1)

and cluster. In this way, for example, the opening hours of cafe could be optimized.

We use the program:procfreqdata=Coffee.compare3;

table cluster*m_1/ allexpected;

run;

Frequency

Expected

Percent

Row Pct Col Pct After Afternoon Does not Evening Morning Usually Total

Meals matter when I

The time meet for

this purpose

1 7 1 23 0 18 8 57 6.172 2.9045 20.694 0.3631 20.331 6.535

2.23 0.32 7.32 0.00 5.73 2.55 18.15

12.28 1.75 40.35 0.00 31.58 14.04

20.59 6.25 20.18 0.00 16.07 22.22

2 7 2 14 1 19 3 46

4.9809 2.3439 16.701 0.293 16.408 5.2739

2.23 0.64 4.46 0.32 6.05 0.96 14.65

15.22 4.35 30.43 2.17 41.30 6.52

20.59 12.50 12.28 50.00 16.96 8.33

3 3 5 20 0 19 7 54

5.8471 2.7516 19.605 0.3439 19.261 6.1911

0.96 1.59 6.37 0.00 6.05 2.23 17.20

5.56 9.26 37.04 0.00 35.19 12.96

8.82 31.25 17.54 0.00 16.96 19.44

4 17 8 57 1 56 18 157

17 8 57 1 56 18

5.41 2.55 18.15 0.32 17.83 5.73 50.00

10.83 5.10 36.31 0.64 35.67 11.46

50.00 50.00 50.00 50.00 50.00 50.00

Total 34 16 114 2 112 36 314

10.83 5.10 36.31 0.64 35.67 11.46 100.00

Statistics for Table of CLUSTER by m_1

Statistic DF Value ProbChi-Square 15 10.6619 0.7762



PhiCoefficient 0.1843


Cramer's V 0.1064



8/2/2019 BI Canetto

37/56

37

These results show that there is no statistically significant relationship between cluster and

coffee drinking time (chi-square with fifteenth degree of freedom = 10.66, p = 0.7762). On the

other hand, SAS suggest that 33% of the cells have expected counts less than 5 and Chi-Square

may not be a valid test. For further calculations The Fishers test should be used.

On the other hand, we also notice that majority of respondents (36%) drink coffee in the

morning, or say, that time is not important (36%) , or that they do it when they meet other

people (11%). Just one respondent mark evening, so overall we can say that respondents are

used to drink coffee all the times, except evening. All the clusters have similar trend, so further

calculations are not necessary for our conclusion.

9.1.6 Cluster x The Type of CoffeeAnother variable to analyze is type of coffee (m_2) preferred by each cluster.


table cluster*m_2/ allexpected;

run;

Frequency ,

Expected

Percent

Row Pct

Col Pct AmericanCaffee LCappucciEspressoOther, n Total

o atte no ot tradi

tional t

ypes

1 1 13 8 32 3 57

2.5414 13.796 11.981 25.777 2.9045

0.32 4.14 2.55 10.19 0.96 18.15

1.75 22.81 14.04 56.14 5.26

7.14 17.11 12.12 22.54 18.75

2 2 6 10 25 3 46

2.051 11.134 9.6688 20.803 2.3439

0.64 1.91 3.18 7.96 0.96 14.65

4.35 13.04 21.74 54.35 6.52

14.29 7.89 15.15 17.61 18.75

3 4 19 15 14 2 54

2.4076 13.07 11.35 24.42 2.7516

1.27 6.05 4.78 4.46 0.64 17.20

7.41 35.19 27.78 25.93 3.70

28.57 25.00 22.73 9.86 12.50

4 7 38 33 71 8 157

7 38 33 71 8

2.23 12.10 10.51 22.61 2.55 50.00

4.46 24.20 21.02 45.22 5.10

50.00 50.00 50.00 50.00 50.00

Total 14 76 66 142 16 314

m4.46 24.20 21.02 45.22 5.10 100.00

8/2/2019 BI Canetto

38/56

38



Chi-Square 12 16.7882 0.1577Likelihood Ratio Chi-Square 12 17.7738 0.1227




Cramer's V 0.1335



The results show that there is no statistically significant relationship between cluster and type

of coffee preferred (chi-square with 12 degree of freedom = 16,79, p = 0.1577). Again, SAS

suggests that 30% of the cells have expected counts less than 5 and Chi-Square may not be a

valid test. For further calculations The Fishers test should be used.

On the other hand we see that Espresso is the of course the most popular type of coffee (45%),

respondent also choose Caffee Latte (24%) and Cappuccino (21%), other choices are not so

significant.

Cluster 1, analyzing just the frequencies is fonder of Espresso, which is 56% instead ofexpected 45%. They also like Caffe Latte (23%), but do not choose so much

Cappuccino (14% instead of 21%).

Cluster 2 respondents are also Espresso drinkers: 54% after expected 45%. Secondchoice is Cappucino (22%), but Caffe Latte is not so popular here ( 13%, instead of

24%).

Cluster 3 is of definitely Caffee Latte drinkers (35% instead of 24%). Their secondchoice is Cappuccino (27% instead of 21%), third- Espresso. On the other hand

Espresso frequency is quite lower than expected (26% instead of 45%)

8/2/2019 BI Canetto

39/56

39

9.1.7 Cluster x Times Per DayIn his part we will evaluate each cluster comparing with times per day (m_3) respondents drink

coffee.

procfreqdata=Coffee.compare3;table cluster*m_3/ allexpected;

run;

Frequency

Expected

Percent

Row Pct Col Pct 1 or less 2, 3, >3 , Total

1 13 20 11 13 57

17.427 21.783 8.7134 9.0764

4.14 6.37 3.50 4.14 18.15 22.81 35.09 19.30 22.81

13.54 16.67 22.92 26.00

2 12 22 6 6 46

14.064 17.58 7.0318 7.3248

3.82 7.01 1.91 1.91 14.65

26.09 47.83 13.04 13.04

12.50 18.33 12.50 12.00

3 23 18 7 6 54

16.51 20.637 8.2548 8.5987

7.32 5.73 2.23 1.91 17.20

42.59 33.33 12.96 11.11

23.96 15.00 14.58 12.00

4 48 60 24 25 157

48 60 24 25

15.29 19.11 7.64 7.96 50.00

30.57 38.22 15.29 15.92

50.00 50.00 50.00 50.00

Total 96 120 48 50 314

30.57 38.22 15.29 15.92 100.00



Chi-Square 9 9.2367 0.4157





Cramer's V 0.0990

8/2/2019 BI Canetto

40/56

40

These results confirm zero hypothesis H0 that there is no statistically significant relationship

between cluster and times a day coffee is used (chi-square with 9 degree of freedom = 9.2367,

p = 0.4157). The probability of no correlation is 41%. There are no very significant differences

analyzing one by one clusters and variable answers. The only notice could be made, that in

Cluster 3 respondents choose more often than expected drinking coffee less than once a day.

We have 43% instead of expected 31% frequency. Knowing the characteristics of clusters, we

can say, that probably respondents relate coffee with socializing, dessert, not every day routine.

Analyzing frequencies in the Cluster 1 and Cluster 2 respondents usually choose coffee twice a

day. Also in cluster 3 twice a day choice is significant.

9.1.8 Cluster x Way of Drinking CoffeeThis time we will compare cluster and the way of drinking coffee (m_4), using the program:

procfreqdata=Coffee.compare3;table cluster*m_4/ allexpected;

run;

Frequency

Expected

Percent

Row Pct

Col Pct Bar Take Sitting in Total

Away Cafe

1 11 11 35 57

13.433 11.618 31.949

3.50 3.50 11.15 18.15

19.30 19.30 61.40

14.86 17.19 19.89

2 18 9 19 46

10.841 9.3758 25.783

5.73 2.87 6.05 14.65

39.13 19.57 41.30

24.32 14.06 10.80

3 8 12 34 54

12.726 11.006 30.268

2.55 3.82 10.83 17.20

14.81 22.22 62.96

10.81 18.75 19.32

4 37 32 88 157

37 32 88

11.78 10.19 28.03 50.00

23.57 20.38 56.05

50.00 50.00 50.00

Total 74 64 176 314

23.57 20.38 56.05 100.00

8/2/2019 BI Canetto

41/56

41



Chi-Square 6 9.5977 0.1426




Cramer's V 0.1236

The results confirm zero hypothesis H0 that there is no statistically significant relationship

between cluster and way to drink coffee is used (chi-square with 6 degree of freedom = 9.5977,

p = 0.1426). The probability of no correlation is not so strong- 14%.

Analyzing frequencies one by one, we notice some different values than expected in Cluster 2,

where respondent choose more often than usual to drink coffee fast next to the bar (39%

instead of expected 24%), and less than usual taking a cup of coffee without a hurry in a caf

(41% instead of 56%)

Overall, looking at all the sample, sitting in a caf, taking the time is most popular way to drink

coffee (56%) take-away (20%) and fast coffee in a bar (24%) have more or less the same

popularity

8/2/2019 BI Canetto

42/56

42

9.2 Country X VariableIn this part we will compare our variable Country, with other qualitative variables, related to

personal coffee drinking habit (questions m_1-m_4). As mentioned before, for marketing

decisions variable country is important to analyze, because of cultural differences.

9.2.1 Country x The Time of the DayFirstly we compare variable country, with time of the day to drink coffee (m_1), using the

program:


table country*m_1/ allexpected;

run;

Frequency

Expected

Percent

Row Pct

Col Pct After meAfternooDoes notEvening Morning Usually Total

als n matter when I m

the time eet othe

r people

for thi

s purpos

e

Italy 20 4 34 2 42 0 102

11.045 5.1975 37.032 0.6497 36.382 11.694

6.37 1.27 10.83 0.64 13.38 0.00 32.48

19.61 3.92 33.33 1.96 41 in 0.00

58.82 25.00 29.82 100.00 37.50 0.00

Lithuanian 10 12 48 0 40 16 126 13.643 6.4204 45.745 0.8025 44.943 14.446

3.18 3.82 15.29 0.00 12.74 5.10 40.13

7.94 9.52 38.10 0.00 31.75 12.70

29.41 75.00 42.11 0.00 35.71 44.44

Palestine 4 0 32 0 30 20 86

9.3121 4.3822 31.223 0.5478 30.675 9.8599

1.27 0.00 10.19 0.00 9.55 6.37 27.39

4.65 0.00 37.21 0.00 34.88 23.26

11.76 0.00 28.07 0.00 26.79 55.56

Total 34 16 114 2 112 36 314

10.83 5.10 36.31 0.64 35.67 11.46 100.00

8/2/2019 BI Canetto

43/56

43

Statistics for Table of country by m_1



8/2/2019 BI Canetto

44/56

44

9.2.2 Country x Type of CoffeeAnother variable to compare by countries is the type of coffee (m_2).

procfreqdata=Coffee.compare3;table country*m_2/ allexpected;

run;

Frequency

Expected

Percent

Row Pct

Col Pct AmericanCaffee LCappucciEspressoOther, n Total

o atte no ot tradi

tional t

ypes

Italy 6 6 8 82 0 102

4.5478 24.688 21.439 46.127 5.1975

1.91 1.91 2.55 26.11 0.00 32.48

5.88 5.88 7.84 80.39 0.00

42.86 7.89 12.12 57.75 0.00

Lithuanian 6 64 22 30 4 126

5.6178 30.497 26.484 56.981 6.4204

1.91 20.38 7.01 9.55 1.27 40.13

4.76 50.79 17.46 23.81 3.17

42.86 84.21 33.33 21.13 25.00

Palestine 2 6 36 30 12 86

3.8344 20.815 18.076 38.892 4.3822

0.64 1.91 11.46 9.55 3.82 27.39

2.33 6.98 41.86 34.88 13.95

14.29 7.89 54.55 21.13 75.00

Total 14 76 66 142 16 314

4.46 24.20 21.02 45.22 5.10 100.00




8/2/2019 BI Canetto

45/56

45

There definitely is statistically significant relationship between country and type of the coffee

(chi-square with eight degree of freedom = 151.87, p =

8/2/2019 BI Canetto

46/56

46

Palestine 30 24 18 14 86

26.293 32.866 13.146 13.694

9.55 7.64 5.73 4.46 27.39

34.88 27.91 20.93 16.28

31.25 20.00 37.50 28.00

Total 96 120 48 50 314

30.57 38.22 15.29 15.92 100.00




8/2/2019 BI Canetto

47/56

47

9.2.4 Country X The Way of DrinkingFinally we will compare country with the way of drinking coffee (m_4).


table country*m_4/ allexpected;

run;

Frequency

Expected

Percent

Row Pct

Col Pct Bar Take Sitting in

Away Cafe

Italy 66 12 24 102

24.038 20.79 57.172

21.02 3.82 7.64 32.48

64.71 11.76 23.53

89.19 18.75 13.64

Lithuanian 4 36 86 126

29.694 25.682 70.624

1.27 11.46 27.39 40.13

3.17 28.57 68.25

5.41 56.25 48.86

Palestine 4 16 66 86

20.268 17.529 48.204

1.27 5.10 21.02 27.39

4.65 18.60 76.74

5.41 25.00 37.50

Total 74 64 176 314

23.57 20.38 56.05 100.00




8/2/2019 BI Canetto

48/56

48

of 56%), moreover take-away culture is not so common for this culture (12% instead of

20%).

Lithuanians opposite, more than expected like to take their time to drink a cup of coffee(68% instead of 56%). If not this choice, saving time Lithuanians take-away their cupof coffee (29%). They do not have a habit to drink coffee in a hurry just next to the bar

(3% instead of expected 24 %).

Palestinians are more similar to Lithuanians, than Italians. Firstly they prefer takingtheir time to have a cup of coffee (76% instead of expected 56%). 19% of Palestinians

like take-away coffee. On the other hand just a few of them (5% instead of expected

24%) are for taking fast coffee next to the bar.

8/2/2019 BI Canetto

49/56

49

9.2.5 The Most Significant Features of The CountriesAfter comparing three countries with different variables, we can recognize some obvious

features and differences between Italy, Lithuania and Palestine.

9.2.5.1ItalyItalians are people with specific coffee drinking traditions. Firstly they usually drink coffee

twice: in the morning and after the meals, or any other time of the day. Italians prefer taking

Espresso and more likely fast, next to the bar. These would be the most significant features of

Italian respondents.

9.2.5.2LithuaniaLithuanians drink coffee once or twice a day, usually morning and then any other time of the

day, often with the purpose of socializing. Lithuanians, despite most popular coffee- Espresso,

moreover they are real Coffee Latte lovers. Moreover they enjoy sitting in the caf and takingtheir time.

9.2.5.3PalestinePalestinians are more similar to Lithuania, than Italy. Palestinians usually drink coffee in the

morning, and the not related to the timetable, for the purpose of meeting people and

socializing. Palestinians drink coffee or really rarely, like less than once a day, or three or more

times. Palestinians have a strong preference for Cappuccino; moreover, as everywhere,

Espresso is also important. Palestinians more than other people like different, not traditional

tastes of coffee. This nation the same as Lithuanians, have a strong preference for taking theirtime in a caf for coffee.

8/2/2019 BI Canetto

50/56

50

10 Strategic DecisionsFinally we came to the conclusion, where we will determine the different groups with theirparticular characteristics and habits, which tell us what kind of caf would be popular for each

group. Firstly we will determine the most significant features of each country. Later we will

add these features making strategic marketing decision what kind of caf to open in ach area.

10.1 Cluster 1- Sophisticated Coffee and Cigarettes

As we saw above, cluster 1 is a group of people, who enjoy smoking, good interior andatmosphere, and socializing during their coffee drinking process. Respondents with mentioned

characteristics are spread in all three countries, without any significant differences from

expected frequencies. This cluster has a significant feature- most of the respondents are

smokers. Moreover, as common for all the sample, people in the cluster mostly choose taking

their time to drink coffee.

For this group of respondents fashionable cafes would be opened. The biggest attention should

be paid for creating of interior and cozy atmosphere, inviting to stay inside longer. Even it is

forbidden to smoke inside; the smoking area should be available, with heaters for winter(especially in Lithuania). The cafes should be situated in strategically comfortable location for

meetings. While Italians tend to drink coffee fast, next to the bar, the caf in Italy should have

less places for sitting and more attention should be paid for attractive bar to take fast coffee on

the way, or with cigarette outside. In other countries a bar is not necessary; more attention must

be paid to creating enough places to sit. The menu in a sophisticated place should include

traditional types of coffee.

8/2/2019 BI Canetto

51/56

51

10.2 Cluster 2- Fast Coffee

Cluster 2 describes people, whose preference is good quality strong coffee, mostly Espresso,

moreover Ice- Coffee (we assume that in summer season). Here people do not like spend a lotof time for the process, they prefer fast coffee. Moreover Cluster 2 has a higher frequency of

Italians, than expected.

The caf to satisfy the needs of the group described by Cluster 2 should be simple coffee bar in

convenient locations: next to offices, city center shopping places, universities, lunch

restaurants. Good quality strong coffee and ice coffee choice are summer is essential features.

No investment should be made in extended menus with not traditional tastes of coffee.

Knowing the features of coffee drinking habits in Italy and the fact, that cluster has a

significant number of Italians, firstly these coffee bars should be opened in Italy. Knowing thatthere is a big competition of similar concept places in this country, we would compete with

good quality coffee, convenient locations and fast service or the possibility of self service to

make the process less time consuming.

8/2/2019 BI Canetto

52/56

52

10.3 Cluster 3- Sweet Break or Take-Away

Cluster 3 describes people, who like meeting other people, and while socializing, having a cup

of coffee, with different tastes, or a dessert next to eat. Their preferred type of coffee in Caffe

Latte and probably its variations. These respondents also choose Take-Away coffee. This way

of coffee relating with socializing and dessert is more common to Lithuanians.

Here Starbucks style coffee place should be opened. The caf would offer an extended menu

of different tastes. Most of the attention should be paid to Caffee Latte with different syrups.

An attractive dessert menu should be available. The place should be cozy to spend some time

there, but simple, keeping the prices low.

Knowing the features of three countries, and the fact that this cluster is more common to

Lithuanians, firstly we open this concept cafes in Lithuania. Also Palestine shouldnt beforgotten, as they express their preference for different types of coffee, drinking it many times

a day and taking their time for the process. For the moment we should not invest in opening

such lace in Italy, as there people have slightly different habits.

8/2/2019 BI Canetto

53/56

53

11 AppendixAppendix 1. Questionnaire.

Coffee drinking habits in Palestine, Lithuania and Italy

We are making a research about the preferences of cafe and coffee drinking habits in three different

cultures, with the purpose to find the best concept of cafe in each location. So the questionnaire is

oriented to your coffee drinking habits in cafes, not at home (if, for example you drink coffee every

morning at home, and later in a cafe with other people, please relate the answers more with the second).

Please keep that in mind answering the questions. If you do not drink coffee, do not fill the

questionnaire. We are kindly asking to fill the questionnaire just if you are originally from Lithuania,

Italy or Palestine. For each multiple choice question choose just one best answer. For scale typequestions, evaluate the argument or answer the question, when 1 is the most negative answer, 10 is the

most positive. The survey is absolutely anonymous.

1. Choose one favourite type of coffee, which you usually drink, from the list below: * Espresso Americano Cappuccino Caffee Latte Other, not traditional types

2. When do you usually drink coffee? * Morning Afternoon Evening After meals Does not matter the time Usually when I meet other people for this purpose

3. How many times a day do you usually drink coffee? * 1 or less 2 3

8/2/2019 BI Canetto

54/56

8/2/2019 BI Canetto

55/56

8/2/2019 BI Canetto

56/56

Working person Unemployed Other

18.Are you smoker? * No Yes

bi canetto

Documents