bi canetto
TRANSCRIPT
-
8/2/2019 BI Canetto
1/56
CLAMDA - INTERNATIONAL MANAGEMENTFACULTY OF ECONOMICS
Coffee Drinking Habits in Lithuania, Palestine and Italy
Business Intelligence Written Assignment
Kotryna Garsvaite
Mahran Sharqawi
Marcello Canetto
Professor Furio Camillo
-
8/2/2019 BI Canetto
2/56
2
CONTENT
Introduction...................................................................................................................................3
1 The Research .........................................................................................................................4
2 Importing Data.......................................................................................................................5
3 Simple Statistics ....................................................................................................................7
4 Principal Component Analysis ..............................................................................................8
5 Size Effects Removal .........................................................................................................12
5.1 Principal Component Analysis after Size Effect Removal ..........................................14
6 The Cluster Analysis ...........................................................................................................18
7 Wards Method ....................................................................................................................20
7.1 Dendrogram - graphical representation........................................................................21
8 T-TEST Procedure...............................................................................................................23
8.1 Preparation of the dataset .............................................................................................23
8.2 T-Test for Respondents ................................................................................................23
8.3 Cluster 1 .......................................................................................................................25
8.4 Cluster 2 .......................................................................................................................26
8.5 Cluster 3 .......................................................................................................................27
9 Proc Freq Procedure (Chi Square Test) ...............................................................................28
9.1 Cluster x Variable ........................................................................................................29
9.2 Country X Variable ......................................................................................................42
10 Strategic Decisions ..........................................................................................................50
10.1 Cluster 1- Sophisticated Coffee and Cigarettes ...........................................................50
10.2 Cluster 2- Fast Coffee ..................................................................................................51
10.3 Cluster 3- Sweet Break or Take-Away ........................................................................52
11 Appendix..........................................................................................................................53
-
8/2/2019 BI Canetto
3/56
3
Introduction
The energizing effect of the coffee bean plant is thought to have been discovered in thenortheast region of Ethiopia, and the cultivation of coffee first expanded in the Arab world.
The earliest credible evidence of coffee drinking appears in the middle of the 15th century, in
the Sufi monasteries ofYemen in southern Arabia. From the Muslim World, coffee spread to
Italy, then to the rest of Europe.
Coffee had been through many centuries a popular drink. Searching through history pages for
the roots of this amazing drink, can lead to lot of stories, legends, we may say. From the South
American countries as Brazil, through the northeast region of Ethiopia, to the Arab peninsula
reaching Europe, people had used coffee for its stimulating effect on humans due to its caffeinecontent.
Because of the popularity and attractiveness of coffee and possibility to spread the
questionnaire in different countries, we chose as a subject of our research the topic Coffee
Drinking Habits in Lithuania, Palestine and Italy. Our goal is to find the best concept of cafes
for different groups of people in three different counties.
http://en.wikipedia.org/wiki/Ethiopiahttp://en.wikipedia.org/wiki/Arabhttp://en.wikipedia.org/wiki/Sufismhttp://en.wikipedia.org/wiki/Yemenhttp://en.wikipedia.org/wiki/Arabiahttp://en.wikipedia.org/wiki/Muslim_worldhttp://en.wikipedia.org/wiki/Muslim_worldhttp://en.wikipedia.org/wiki/Arabiahttp://en.wikipedia.org/wiki/Yemenhttp://en.wikipedia.org/wiki/Sufismhttp://en.wikipedia.org/wiki/Arabhttp://en.wikipedia.org/wiki/Ethiopia -
8/2/2019 BI Canetto
4/56
4
1 The ResearchIn our research, we tried to examine the habits of drinking coffee outside home, in three
different countries, Italy, Lithuania and Palestine. To achieve this goal we have formulated aquestionnaire, mainly aimed to people in these three different countries, located in different
points on the map, having different climates and of course different cultures. Our purpose is to
try to find the differences between the habits in drinking coffee in these three countries, in
addition, to find similarities between specific groups in these countries.
We are also aware of the wide range of respondents, and the effect of other factors to their
answers. Coffee has different perception in these different countries, still, worldwide network
companies such as Starbucks, may have a similar effect on consumers in different places.
Through our questionnaire we tried to get information about drinking coffee habits such as thetype of coffee preferred, times people prefer to drink coffee, and other factors which are
important for people who drink their coffee outside.
The questionnaire was worded as clearly as possible to try and give everyone the ability to
understand the questions and answers with no errors. Another feature is its simplicity; we tried
to structure the questions as simply as possible so that respondents could answer the questions
at the minimum time available.
The result was a questionnaire of 18 questions (Appendix 1).
First we placed four qualitative questions; favorite type of coffee, frequency of drinking coffee,
how many times, and the modality of drinking coffee outside.
Then, we identified 9 factors we consider as important for our research in order to understand
the reasons behind the decisions made by the respondents, we formulated 9 questions and
asked to rate them with a scale from 1 to 10, where 1 is the most negative/ not important
evaluation, 10-the most positive/ important.
These 9 factors are:
1.
Interior and atmosphere of the place.
2. Socializing with people.3. Effect of caffeine.4. Traditional tastes of coffee.5. Importance of the price.
-
8/2/2019 BI Canetto
5/56
5
6. Smoking.7. "Take-away culture".8. Ice coffee.9. Dessert/croissant.
As our target audience was from three different countries, and it is hard to reach them
physically in order to hand the questionnaire, we reached them through internet. We published
the questionnaire online for four days;
We also placed the generic questions in the end of the questionnaire in order to identify our
respondents
At the end our sample was 157 useful observations, around 40-60 from each country. As the
most active were Lithuanians, the most passive- Palestinians.
The data were originally cataloged by Microsoft Excel and later imported into
SAS software.
2 Importing DataThe first step was to import the data from Excel to SAS using the import function Wizard.
Then we renamed and labeled the questions as follows:
ID id
1. Question m_1 (type of coffee) we gave it the label type2. Question m_2 (time of drinking) we gave it the label time3. Question m_3 (times drinking) we gave it the label times a day4. Question m_4 (preferred drinking way) we gave it the label way to drink5. Next we placed 9 sub questions which asked the importance of different
factors in choosing for the respondents.
6. Question s_1 (interior and atmosphere) we gave it the label interior7. Question s_2 (socializing with people) we gave it the label socializing8. Question s_3 (effect of caffeine) we gave it the label caffeine9. Question s_4 (traditional tastes) we gave it the label tastes10.Question s_5 (importance of price) we gave it the label price
-
8/2/2019 BI Canetto
6/56
6
11.Question s_6 (relating smoking) we gave it the label smoking12.Question s_7 (take away cultural) we gave it the label take away13.Question s_8 (ice coffee) we gave it the label ice coffee14.Question s_9 (dessert/croissant) we gave it the label dessert15.Question Country we gave it the label country16.Question Gender we gave it the label sex17.Question Age we gave it the label age18.Question 8 Occupation we gave it the label occupation19.Question 9 Smoking we gave it the label smoker
To do so, we had the following commands in SAS:
data Coffee.Coffee;
set Coffee.Coffee;
label id='id'
m_1='time'
m_2='type'
m_3='times a day'
m_4='way to drink'
s_1='interior'
s_2='socializing's_3='caffeine'
s_4='tastes'
s_5='price'
s_6='smoking'
s_7='take away'
s_8='ice coffee'
s_9='dessert'
country='country'
gender='gender'
age='age'
occupation='occupation'
smoker='smoker'
run;
Our data was ready for the analysis of the values in the respective tables.
-
8/2/2019 BI Canetto
7/56
7
3 Simple StatisticsWhen our data was, sorted, and renamed, we started with the first data analysis.
The first procedure that we started with is the PROC MEANS. To do this we gave SAS the
command:
procmeansdata=Coffee.Coffee nmeanstddevstddevstderrmediancv;
var s_1-s_9;
run;
With this procedure we can know the number (n), the average (mean), the
standard deviation (stddev), the standard error (stderr), the median (median) and the coefficient
of variation (cv).
The MEANS Procedure
Coeff of
Variable Label N Mean Std Dev Std Error Median Variation
s_1 interior 157 6.8343949 2.4543189 0.1958760 8.0000000 35.9112829
s_2 socializing 157 6.3312102 2.4215606 0.1932616 7.0000000 38.2479888
s_3 caffeine 157 6.5796178 2.6339193 0.2102096 7.0000000 40.0314930
s_4 tastes 157 4.1528662 3.0237640 0.2413226 3.0000000 72.8114953
s_5 price 157 5.3630573 2.6485431 0.2113767 5.0000000 49.3849478
s_6 smoking 143 4.7482517 3.8409736 0.3211983 3.0000000 80.8923740
s_7 take away 157 6.1401274 2.8699562 0.2290474 6.0000000 46.7409882
s_8 ice coffee 157 5.5732484 3.1136668 0.2484977 6.0000000 55.8680781
s_9 dessert 157 5.3503185 2.7939179 0.2229789 5.0000000 52.2196568
In this table, we marked the highest means with lowest standard deviation, which give us a
clear perspective of the important variables in our research. For example, high value in interiorvariable means that respondents give this factors a high importance- choosing the place for
drinking coffee. Moreover, respondents gave high importance for Socializing, Take- Away
option. That tells that for those who are choosing to drink coffee the important factors are
connected to three factors which are not connected to coffee itself, but to the habit of drinking
coffee. Another important fact is that Caffeine is one of the highest four mean values we got,
which tells, that there is a part of respondents relating coffee with its primary feature- caffeine.
-
8/2/2019 BI Canetto
8/56
8
4 Principal Component AnalysisIn this part we try to find out possible relationship between different variables. In other words
we want to see if there is any relationship between the different possible answers to the
questionnaire. To do this you need to do a multivariate analysis of responses, and this is donethrough principal component analysis.
In SAS we use program:
procprincompdata=Coffee.Coffee;
var s_1-s_9;
run;
In this way we will have 3 useful results to be analyzed:
1) Correlation coefficientsCorrelation Matrix
s_1 s_2 s_3 s_4 s_5 s_6 s_7 s_8 s_9
s_1 interior 1.0000 0.5319 -.1563 0.2340 0.1566 -.0505 0.3069 0.0336 0.1214
s_2 socializing 0.5319 1.0000 -.0171 0.3303 0.1707 -.0478 0.3081 0.0174 0.1539
s_3 caffeine -.1563 -.0171 1.0000 0.0022 0.1269 -.0377 -.0592 0.0135 0.1890
s_4 tastes 0.2340 0.3303 0.0022 1.0000 0.1767 -.1145 0.3631 0.1942 0.2501
s_5 price 0.1566 0.1707 0.1269 0.1767 1.0000 0.1137 -.0102 0.0520 0.0889
s_6 smoking -.0505 -.0478 -.0377 -.1145 0.1137 1.0000 -.0909 -.0998 -.1030
s_7 take away 0.3069 0.3081 -.0592 0.3631 -.0102 -.0909 1.0000 0.3366 0.1401
s_8 ice coffee 0.0336 0.0174 0.0135 0.1942 0.0520 -.0998 0.3366 1.0000 0.2037
s_9 dessert 0.1214 0.1539 0.1890 0.2501 0.0889 -.1030 0.1401 0.2037 1.0000
The first observation concerning the correlation coefficients is that they are mixed between
positive and negative values; the majority of the values are positive, and only around 10 cases
we had a negative values. We highlighted the highest values in yellow and the lowest in light
blue.
Regarding this time the positive values, the highest value is 0.5319 and it indicates the
correlation between Socializing and Interior/Atmosphere, we can conclude that respondents
who gave importance to the Socializing with people while drinking coffee, gave a big
importance too to the Interior of the cafe and vice versa.
-
8/2/2019 BI Canetto
9/56
9
Even when we had only one correlation above 0.5 we still consider values above 0.2 as high
values, as we are testing a wide range of respondents. Continuing in the standings to second
place we find the correlation between Tastes and Take -Away which had a high value of
0.3631. We assume that respondents who like to take their coffee away with them, care about
the different tastes of coffee.
The third place in our analysis is the correlation between Ice Coffee and Take-Away, with a
value of 0.3366, it could be concluded from this that the ice coffee lovers, take it away.
Moreover, Take-Away people, next to mentioned before different tastes, like also Ice Coffee.
Another positive correlation is Tastes and Dessert (0,25), respondents who like different,
probably sweet tastes of coffee, do not refuse also dessert.
The last high value in our analysis in this table is the correlation between Tastes and
Socializing. It says that people like spending time with others in a coffee place, which can offer
various tastes. Furthermore, Tastes have another positive correlation of 0.234 with Interior.
The negative values show the features, which do not correlate with each other (Caffeine and
Socializing, Smoking with Tastes, Take-Away, Ice Coffee, Dessert).
The observations give us the first view of the trends in our research, which will counted more
precisely in later calculations.
The values are not all positive, which minimize the possibility of having an error called size
effect. Still the data must be corrected to verify any existence error and its possible influence
on the results obtained. To do this we will use a procedure which will be shown later.
Now we continue with the second part of the PRIN COMP analysis.
2) Correlation matrix eigenvaluesWe noted the presence of positive correlation between the variables, but we also got negative
correlation, still we decided to try to eliminate the size effect. For simplicity of the procedure
PRIN COMP in SAS generates new vectors defining a new vector system that is composed of
new, independent and unrelated dimensions. Each principal component is the linear
combination of original variables with the coefficient equal to eigenvector of the correlation
matrix.
-
8/2/2019 BI Canetto
10/56
10
Eigenvalues of the Correlation Matrix
Eigenvalue Difference Proportion Cumulative
1 2.30618151 0.98494347 0.2562 0.2562
2 1.32123803 0.09821232 0.1468 0.4030
3 1.22302572 0.23309129 0.1359 0.5389
4 0.98993443 0.19748232 0.1100 0.6489
5 0.79245210 0.05262728 0.0881 0.7370
6 0.73982482 0.03803349 0.0822 0.8192
7 0.70179132 0.20455790 0.0780 0.8972
8 0.49723342 0.06891478 0.0552 0.9524
9 0.42831865 0.0476 1.0000
The first column shows the length of the eigenvalue of the principal components.
We are interested in considering the eigenvalues to determine the importance of Principal
Components. The first 3 eigenvalues have a value greater than one and therefore the most
significant. However, considering the variance, we note that considering only the first three
would stop at 53% of variance explained.
Other components show lower importance, but still represent 5% and more variables. Ourtarget is not specific so not to loose information we consider all the 9 principal components,
which let us to explain the total variance.
-
8/2/2019 BI Canetto
11/56
11
3) EigenvectorsPrin1 Prin2 Prin3 Prin4
s_1 interior 0.434038 -.417329 0.046529 -.169922
s_2 socializing 0.461075 -.313510 0.160018 -.240113
s_3 caffeine -.010706 0.517591 0.462081 -.245532
s_4 tastes 0.447888 0.103031 0.018111 0.051922
s_5 price 0.187179 -.016211 0.637148 0.248980
s_6 smoking -.129733 -.273119 0.374877 0.665444
s_7 take away 0.441400 0.033958 -.319502 0.240765
s_8 ice coffee 0.260091 0.421480 -.287120 0.516211
s_9 dessert 0.289749 0.442015 0.165447 -.14574
Prin5 Prin6 Prin7 Prin8 Prin9
s_1 0.079371 -.036478 0.389923 -.050380 0.666483
s_2 0.143574 0.169851 0.091209 0.437747 -.597040
s_3 0.180006 0.609017 0.062985 0.064716 0.216151
s_4 -.151813 -.070963 -.787577 0.244862 0.278280
s_5 -.606727 -.119484 0.125885 -.279837 -.142138
s_6 0.546455 0.030417 -.112881 0.092286 0.066560
s_7 0.141779 0.417523 -.065994 -.640667 -.186355
s_8 -.155812 0.007970 0.418987 0.450530 0.054111
s_9 0.454455 -.635840 0.083081 -.203562 -.113544
In the first column Prin1 we count 7 variables positively correlated and 2 variables negativelycorrelated. This observation shows us that there is not such a significant size effect. However,
before further considerations, we will erase the size effect to improve the result of our analysis.
-
8/2/2019 BI Canetto
12/56
12
5 Size Effects RemovalThe operation of size effects removal finds reason in the fact that the values which has been
allocated to the factor of the questionnaire depend to the average value of the judgments of oneperson. These values can greatly change so we'll find very low values in all people have a
more pessimistic view, while higher values in those who are more accustomed to giving high
values to different parameters (optimistic).
The process of size effects removal is made through standardization procedure in SAS. We
start creating 9 new variables (n_1- n_9), they will represent the new values of the 9 scale
questions. These values will be centered to the average value of each
individual. SAS will calculate the maximum, the minimum and the average value or each
individual. Consequently the software will standardize the answers given in a range between -1
and +1. The average value will be represented by 0.
We gave to SAS the following command:
data data Coffee.Coffee_1;
set data Coffee.Coffee;
if _n_
-
8/2/2019 BI Canetto
13/56
13
Now, using our new database, we repeat the initial procedures to control the actual difference
between the original database and the corrected one.
As before, we are going to use the PROC MEANS procedure so in our SAS program we will
write down:
procmeansdata=Coffee.Coffee_1;
var n_1-n_9;
run;
The result is:
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
n_1 157 0.3196513 0.5870416 -1.0000000 1.0000000
n_2 157 0.1830072 0.5670946 -1.0000000 1.0000000
n_3 157 0.2677898 0.6738488 -1.0000000 1.0000000
n_4 157 -0.3539103 0.6861465 -1.0000000 1.0000000
n_5 157 -0.0736206 0.6409227 -1.0000000 1.0000000
n_6 157 -0.1591141 0.8576755 -1.0000000 1.0000000
n_7 157 0.1395236 0.6865145 -1.0000000 1.0000000
n_8 157 0.0133477 0.7477820 -1.0000000 1.0000000
n_9 157 -0.0531759 0.6824755 -1.0000000 1.0000000
Initially we notice that after the standardization the values are within values -1 and +1
The standardization doesnt show surprising effects. In fact the three factors with the highest
value didnt change: Interior, Socializing and Caffeine. However the factor Take-Away,
which it is still at the fourth place, lost its decisive role as factor and loses importance in theanalysis.
Looking at the negative signs (highlighted with blue) we report in order: Tastes, Smoking and
Price. The trend is again similar to the one seen in the procedure before size effect removal..
-
8/2/2019 BI Canetto
14/56
14
5.1 Principal Component Analysis after Size Effect RemovalAfter size effects removal we can repeat the Principal Component procedure using the new
more precise database.
In SAS program we write:
Procprincompdata=Coffee.Coffee_1 out=Coffee.cluster;
var n:;
run;
Correlation Matrix
n_1 n_2 n_3 n_4 n_5 n_6 n_7 n_8 n_9
n_1 1.0000 0.3349 -.1629 -.0016 0.0090 -.1434 0.1069 -.1875 -.0245
n_2 0.3349 1.0000 -.1091 0.1039 0.0056 -.2060 0.0633 -.2269 -.0505
n_3 -.1629 -.1091 1.0000 -.2004 0.0376 -.1189 -.2219 -.0620 0.0918
n_4 -.0016 0.1039 -.2004 1.0000 -.0055 -.2079 0.1704 0.0148 0.1436
n_5 0.0090 0.0056 0.0376 -.0055 1.0000 0.0012 -.1916 -.1977 -.0554
n_6 -.1434 -.2060 -.1189 -.2079 0.0012 1.0000 -.2312 -.1898 -.1804
n_7 0.1069 0.0633 -.2219 0.1704 -.1916 -.2312 1.0000 0.1870 -.0672
n_8 -.1875 -.2269 -.0620 0.0148 -.1977 -.1898 0.1870 1.0000 0.0374
n_9 -.0245 -.0505 0.0918 0.1436 -.0554 -.1804 -.0672 0.0374 1.0000
The outcome of this correlation matrix shows some differences respect to the previous one
without size effects removal.
The highest correlation value (0,3349) is Socializing with Interior. The second highest
correlation (0,1870) matches up Ice coffee with Take-Away. The third one (0,1704) correlates
Take-Away with different Tastes.
Analysing the correlations we can argue that the factors more correlated are similar or can have
logic correlation: if a person links coffee with socializing with other people, then he will
choose a caf with nice interior to spend his time with another person.
Further Ice coffee is a long coffee which needs time to be finished so people can take it away
to drink it slowly.
-
8/2/2019 BI Canetto
15/56
15
Furthermore the negative correlations let us to make some important conclusions. Firstly, the
highest negative correlation (-0,2312) shows that people do not link Smoking with Take-Away.
Also the negative correlation (-0,2269) shows how people do not relate Ice coffee with
Socializing.
Eigenvalues of the Correlation Matrix
Eigenvalue Difference Proportion Cumulative
1 1.73942362 0.19850007 0.1933 0.1933
2 1.54092355 0.26656185 0.1712 0.3645
3 1.27436170 0.23053503 0.1416 0.5061
4 1.04382667 0.11817241 0.1160 0.6221
5 0.92565426 0.15262272 0.1029 0.7249
6 0.77303154 0.09324188 0.0859 0.8108
7 0.67978966 0.09613972 0.0755 0.8863
8 0.58364994 0.14431088 0.0648 0.9512
9 0.43933906 0.0488 1.0000
The eigenvalues of the new correlation matrix has 4 principal components with values greater
than unity that we want to consider. However, they explain only 62% of the information, which
would lead us to consider at least 5 components.
After some considerations we still decided to choose all the 9 variables for our analysis
because our sample is not perceived as homogeneous and specific. It is composed by people
from three different countries and subsequently diverse cultures.
We need all the principal components because we require specific and detailed data. We want a
macro view of the costumer as the scope of this research is to make some macro marketing and
strategic general decision about opening cafs in different countries.
-
8/2/2019 BI Canetto
16/56
16
Eigenvectors
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9
n_1 0.371810 0.427576 -.021365 -.273788 -.143199 0.477643 0.150053 -.541602 0.197539
n_2 0.394858 0.444974 0.100454 -.181705 -.054699 -.283115 -.471510 0.454429 0.301668
n_3 -.324870 -.052409 0.533502 -.428624 0.124845 -.367410 0.230515 -.208916 0.419537
n_4 0.389770 -.104202 0.143501 0.644117 -.002649 -.407235 -.032695 -.437329 0.208731
n_5 -.165954 0.338640 0.207353 0.394213 0.673584 0.353772 0.084423 0.177793 0.204390
n_6 -.429974 0.152762 -.489183 0.222761 -.358620 -.021224 0.084132 0.051592 0.603378
n_7 0.469938 -.245918 -.263986 -.124679 0.197221 -.081206 0.649893 0.339844 0.223837
n_8 0.096979 -.607624 -.094785 -.145799 0.226310 0.331441 -.494236 -.086094 0.422368
n_9 0.075611 -.194617 0.568567 0.227133 -.537096 0.385772 0.141847 0.328755 0.126714
The principal component analysis is a method to reduce the dimensionality of
data. Whether there are two or more variables that explain the same phenomenon, the purpose
is to find a summary of their information between the variables. Then where we have
correlated variables we introduce the use of principal components.
The statistical area Rp, where p is the number of variables, will be reduced in an area Rx with x
-
8/2/2019 BI Canetto
17/56
17
As we have seen we have 9 theoretical PC; increasing the order of the draw the variance of the
components decreases and this indicates that lose their importance (significance). It happens
because the way in which PC are built gives maximum variance to the first, declining
to the followings. However, each PC will bring a new informative content. We decided to use
for our analysis all the principal components because they all have weight equal or higher than
5% and, as already mentioned, our sample is not homogeneous so all the collected information
is useful for our research.
-
8/2/2019 BI Canetto
18/56
18
6 The Cluster AnalysisThe next step is to divide the sample in clusters, also called segments. The purpose
of this analysis is to find groups of people with characteristics sufficiently similar with respectto one or more variables (in this case the segmentation is carried out according to the scale
questions of our questionnaire).
These groups should present a very small variance within them (homogeneity of the cluster)
and, at the same time, a significant variance among them (the clusters must be as diverse as
possible so we can give them a meaning).
In this way we should create homogenous groups of potential customers, easy to find and
study, for example, for marketing purposes.
We have to point out that the numerosity of cluster will be influenced by the purpose of
research: in the case of marketing and strategic general choices, the number of
clusters should be low- we are trying to find best concepts for cafes, and making too many
groups could result to be too various to realize. Vice versa, in the case of micro marketing and
operational decisions the number of clusters should be very high to be effective.
The first step of the cluster analysis is the creation of a distance matrix in which the
observations are the rows and columns while the cells represent the measure of similarity or
distance for each pair of observations. The distance matrix is a tool which puts in relation the
observations in a matrix NxN obtaining the distance in relative terms and the distance andsimilarity between the observations.
Example of distance-matrix
a b c d e
a 0 16 1 9 10
b 16 0 17 25 2
c 1 17 0 4 9
d 9 25 4 0 13
e 10 2 9 13 0
There are various ways of measuring the distance between the observations, the distance
between clusters and their similarity. Then these elements are used as selection criteria for
-
8/2/2019 BI Canetto
19/56
19
deciding whether merge or not to different clusters: the distance measures when the
observations can be considered different. In a hierarchical clustering the initial distances
between observations arranged in the matrix are progressively reduced by the method of the
merger among observations with the shorter distance (minimum distance fusion).
Under this method, in our example, customers a and c are the most similar (value equal to 1).
So the customers a and c will merge into a new unit f. This new unit will have as values of the
distances, respect to other units, the minimum existent values between the two old units.
Following the example we have:
f b d e
f 0 16 4 9
b 16 0 25 2
d 9 25 0 13
e 9 2 13 0
Now the most similar customers are b and e, so they will merge in a new unit g, and so on until
it just remain a single variable. Ant this theory leads to Wards Method.
-
8/2/2019 BI Canetto
20/56
20
7 Wards MethodIn this analysis we will use the Ward's method that allows us to create groups merging
observations when the distance between the two is minimal. This distance is calculated as thesum of Euclidean distances squared. The distance will then be calculated using the Pythagorean
Theorem.
Sum of squared errors= SSE= i j k (Xi,jk Yi,jk )
Therefore this method will tend to maximize the so-called variance between (or among the
different clusters) and to minimize the within (i.e. within the cluster):
VAR (X)= Var Within+ Var Between
Var within= [ (Xip- Xp)] / Np The sum of the values of the ith unit -
average value of group
divided for the whole group
Var btw= [ (Xp- X)2 Np ] / Ntot The sum of the average value of the group
(cluster) - total average value) multiplied
by the whole group] divided
the whole population
Below there is the formula we used with SAS to create the dataset Coffee.tree.
procclusterdata=Coffee.clustermethod=wardouttree=Coffee.tree;var Prin1-Prin9;id id;run;
Then we utilized this dataset to draw the dendrogram which helped us to find the number ofsignificant clusters for our analysis.
-
8/2/2019 BI Canetto
21/56
21
7.1 Dendrogram - Graphical RepresentationIn clustering procedure, the dendrogram is used to provide a graphical representation of the
process of grouping observations. It provides a graphical representation of the relative distance
to which the statistics units are melted together. The graph is represented in a Cartesian plane
with axis X as the logical distance of the cluster according to the measure defined and axis Y
the hierarchical level of aggregation or Fusion Distance.
The choice of the hierarchical level defines the number of clusters adoptable for the analysis.
The observations will be aggregated and distributed among them according to their degree of
proximity, the further are the observations on the axis X, lower is the possibility that they can
aggregate into the same cluster.
However, this probability is also function of the level offusion distance accepted, and since it
is well known that higher is the level of the hierarchy chosen, lower is the number of clustersfound.
Regarding to our research, we will use the dendogram to identify the relevant number of
clusters to examine.
Now it is possible, therefore, to create the dendrogram through SAS. Trying a few different
numbers of clusters, we stayed at the number of 3. The result of the procedure is the graph
below which was appropriately cut to identify these 3 different clusters. The cut was made at a
point r = 0.06, indicating a pretty good level of accuracy. The dendogram could be cut below,
making higher number of clusters, but to our strategic decisions, as mentioned before, thelower number serves better.
The procedure in SAS used:
proctreedata=Coffee.tree nclusters=3out=Coffee.parti3; id id; run;
-
8/2/2019 BI Canetto
22/56
22
Semi-PartialR-
Squared
0.000
0.025
0.050
0.075
0.100
id
16768107175124
96688889123
115
126
4129
7897394563142
74153
238120
20131
9216444760549105
90151
157
2236144
70113
4394103
108
4653147
86111
84132
1185146
100
150
12118
136
116
127
1379110
133
107
119
14139
140
55596272101
2637138
17569533299358109
2132616925112
18125
4882141
40134
102
64156
152
65148
32773121
152351243134122
143
128
135
950137
104
30543577155
72842106
91768098130
145
154
1981528741114
665783149
117
99
SAS Dendogram. 3 Clusters
-
8/2/2019 BI Canetto
23/56
23
8 T-TEST Procedure8.1 Preparation of the datasetFollowing the identification of several clusters we now understand what the characteristicsdistinctive of each group found are.
To do this we created generic three clusters in which we included all the answers received.
This allows us to compare the mean scores of each cluster with the average general results and
understand how and what clusters are different from general comments
8.2 T-Test for RespondentsTo study the relationship between quantitative variables and the segments that we have
obtained using the PROC CLUSTER we can use the T-test.
This test allows us to calculate a value T that is associated with the exact probability of the
variable being tested, the average calculated for the cluster differs from the average calculated
on the entire sample only by chance. It follows this more likely, indicated in SAS Pr> | t |, is
small, the lower the probability that the difference between the means is caused by the random
effect, and then increased the probability that the variable is instead significant to explain that
cluster.
As for the significance level to the target value in literature is 0.05. The null hypothesis is then
accepted if the probability is less than 0.05, with a 95% assurance.
The t-value found in the back edge was calculated as follows: t = (c - tot) / sc sc = standard
error, estimation of variability of the estimator mean.
Now some words should be said about the variance, having no course available the variance of
the population, an estimator must be used of the variance. SAS calculates the T test using two
methods, which differ just for the treatment of variance proceeds from which will then be used
in the standard error T-test
The Satterthwaite method is calculating the standard error forn dividing the weighted average
of the two variances (of the cluster and population). This method does not place the assumptionof equality of variances, and can be applied in all circumstances. The Pooled method differs
from the previous one obtained the standard error from the arithmetic mean of the two
variances, and doing so requires equality variances: the result that the latter can only be applied
in specific circumstances, namely when the result of the equality of variances F-test confirms
the null hypothesis. Considering then that in the event of equality of variances, the two
methods produce the same value of T; it seems more efficient to use the Satterthwaite method.
-
8/2/2019 BI Canetto
24/56
24
Before further calculations, the data should be sorted and merged in order to create general
cluster 4 to be able to pompare data in T-test:
procsortdata=Coffee.parti3;by id;
run;
data Coffee.compare;
merge Coffee.cluster
Coffee.parti3;
by id;
run;
data Coffee.compare1;
set Coffee.compare;
cluster=4;
run;
data Coffee.compare3;
set Coffee.compare Coffee.compare1;
run;
It is possible now to check in each cluster, which variables can be used to describe the
specificity, describing in turn the direction and strength of this relationship, if any.
-
8/2/2019 BI Canetto
25/56
25
8.3 Cluster 1procttestdata=Coffee.compare3;
var n_1-n_9;
class cluster;
where cluster=1 or cluster=4;
run;
8.3.1 Features:Method Variances DF t Value Pr > |t|
1 Satterthwaite Unequal 131.32 3.24 0.0015interior
2 Satterthwaite Unequal 108.6 1.60 0.1124socializing
3 Satterthwaite Unequal 90.227 -2.45 0.0161caffeine4 Satterthwaite Unequal 97.436 -0.96 0.3406
5 Satterthwaite Unequal 108.28 -0.71 0.4784
6 Satterthwaite Unequal 123.69 5.53
-
8/2/2019 BI Canetto
26/56
26
It leads us to a conclusion in case opening the cafe, that respondents in Cluster 1 value a lot a
possibility to smoke, interior of the caf and socializing during the process. While for their
decision where to drink coffee is not important if there is a choice of Ice Coffee, is the coffee is
strong or not, or if they have a possibility to take-away.
8.4 Cluster 2procttestdata=Coffee.compare3;
var n_1-n_9;
class cluster;
where cluster=2 or cluster=4;
run;
8.4.1 Features:Method Variances DF t Value Pr > |t|
1 Satterthwaite Unequal 69.053 -3.96 0.0002interior
2 Satterthwaite Unequal 85.885 -5.26
-
8/2/2019 BI Canetto
27/56
27
8.5 Cluster 3procttestdata=Coffee.compare3;
var n_1-n_9;
class cluster;
where cluster=3 or cluster=4;
run;
8.5.1 Features:Method Variances DF t Value Pr > |t|
1 Satterthwaite Unequal 108.85 1.19 0.2367
2 Satterthwaite Unequal 107.46 2.96 0.0037socializing
3 Satterthwaite Unequal 104.54 1.02 0.3121
4 Satterthwaite Unequal 94.269 2.99 0.0036tastes
5 Satterthwaite Unequal 89.599 1.77 0.0802price
6 Satterthwaite Unequal 199.78 -10.11 |t|).
Analyzing the t-value, we find that except the variable Smoking, all the other variables had a
positive influence on the choice of the members of this group, which means they care about the
named features during their drinking coffee time. While Smoking here is not important.
We can see from the results that the variables Tastes, Socializing and Take-Away have the
biggest effect, slightly lower effect- Dessert. The less effective one was the Price among these
five variables. It is clear that we are dealing with people who like to have Coffee of different
tastes, taking it with dessert and they associated with socializing with other people.
-
8/2/2019 BI Canetto
28/56
28
9 Proc Freq Procedure (Chi Square Test)After determining three clusters and their particular characteristics, it is important to compare
each cluster with qualitative characteristics in order to know more about each cluster.Moreover we can compare two qualitative variables with each other, to understand better our
sample. For this calculation we will use PROC FREQ procedure and CHI SQUARE (Chisq)
test.
Chisqprovides chi-square tests of independence of each stratum and computes measures of
association. The chi-square test is used when you have one variable/group (cluster) and
compare it with two or more values (sex, country, age, etc.). The observed counts of numbers
of observations in each category are compared with the expected counts, which are calculated
using some kind of theoretical expectation.
Firstly the null hypothesis is that variables are independent with each other (cluster and
country, age, etc.), opposite hypothesis is that variables are not independent- correlate with
each other. Analyzing each frequency, the statistical null hypothesis is that the number of
observations in each category is equal to that predicted, and the alternative hypothesis is that
the observed numbers are different from the expected. The test will let us to confirm or reject
major hypothesis, that the clusters and chosen variable are independent, furthermore, compare
frequencies of each group in each cluster.
The test statistic is calculated by taking an observed number (O), subtracting the expected
number (E), and then squaring this difference. The larger the deviation from the null
hypothesis, the larger the difference between observed and expected is. Squaring the
differences makes them all positive. Each difference is divided by the expected number, and
these standardized differences are summed.
The shape of the chi-square distribution depends on the number of degrees of freedom. For an
extrinsic null hypothesis, the number of degrees of freedom is simply the number of values of
the variable, minus one. The degrees of freedom in a test of where there are more than one
nominal variable, the degree of freedom is equal to (number of rows)1 (number of
columns)1; in our case 43 table, there are (41)(31)=6 degrees of freedom.
In practice, the main hypothesis for evaluating each variable, comparing it with cluster is:
H0 = variables are independent:
H1 = variables are not independent.
http://udel.edu/~mcdonald/statvartypes.html#nominalhttp://udel.edu/~mcdonald/stathyptesting.html#nullhttp://udel.edu/~mcdonald/stathyptesting.html#nullhttp://udel.edu/~mcdonald/statvartypes.html#nominal -
8/2/2019 BI Canetto
29/56
29
We say that variables are independent, and we confirm hypothesis H0, when CHI SQUARE
probability p > 0.05 confidence level (we choose 95% confidence level as a default), otherwise
we reject H0 and take H1.
9.1 Cluster x VariableIn this part of the analysis we will compare each cluster with all qualitative characteristics,
firstly all from generic questions, than questions m_1-m_4. We will not compare cluster just
with variable occupation, as it does not give a lot of information, knowing that the majority
of respondents are students.
9.1.1 Cluster x CountryOur biggest difference among people, who participated in the survey, is country, as they have
different culture and habits. Firstly we will compare each cluster with the country, where they
live.
The program we use is SAS is:
procfreqdata=Coffee.compare3;
table cluster*country / allexpected;
FrequencyExpectedPercentRow PctCol Pct Italy Lithuania Palestine Total
1 15 24 18 57 18.516 22.873 15.611 4.78 7.64 5.73 18.15 26.32 42.11 31.58 14.71 19.05 20.93
2 23 13 10 46 14.943 18.459 12.599 7.32 4.14 3.18 14.65 50.00 28.26 21.74 22.55 10.32 11.63
3 13 26 15 54 17.541 21.669 14.79 4.14 8.28 4.78 17.20 24.07 48.15 27.78
12.75 20.63 17.44
4 51 63 43 157 51 63 43 16.24 20.06 13.69 50.00 32.48 40.13 27.39 50.00 50.00 50.00
Total 102 126 86 31432.48 40.13 27.39 100
-
8/2/2019 BI Canetto
30/56
30
Statistics for Table of CLUSTER by country
Statistic DF Value Prob
Chi-Square 6 9.6280 0.1412
Likelihood Ratio Chi-Square 6 9.3295 0.1559Mantel-Haenszel Chi-Square 1 0.0052 0.9428
Phi Coefficient 0.1751
Contingency Coefficient 0.1725
Cramer's V 0.1238
We see that Chi-Square prob- 0.1412> 0,05 (Chi-square value- 9,626, with 6 degree of freedom).
We confirm the zero hypotheses H0 and state that cluster and variable country are independent.
On the other hand, analyzing the fields one by one, firstly we see that probability of independence
is 14%, which is not very high and Chi-Square value is more than 9, not extremely low, there might
be some relationships. We can find some differences in each cluster between expected value and
the real frequency:
Cluster 2: has frequency of 23 Italians, instead of expected 15, which is 50% instead of32%, which means that the second cluster has features more common to Italians. Moreover,
the same cluster has a little bit lower than expected frequency of Lithuanians, which is 28%
instead of 40%. So we see that these cluster characteristics are not so common to
Lithuanians. Palestinians do not have significant difference between expected and real
frequency.
Cluster 3 does not have really significant differences from expected value. A little bitlower frequency than expected there we find of Italians. 24% instead of 32%, and a little bit
more than expected Lithuanians, 48% instead of 40%.
All in all, Cluster 1 is common for all three countries. Cluster 2 is more suitable for Italians,
Palestinians also do not reject it. Cluster 3 reflects more Lithuanian habits, Palestinians do not
reject it, Italians preferences slightly differ here.
-
8/2/2019 BI Canetto
31/56
31
9.1.2 Cluster x GenderSecond characteristic, by which we will compare clusters is gender (sex). Here we use a SAS
program:
procfreqdata=Coffee.compare3;table cluster*sex / allexpected
run;
FrequencyExpectedPercentRow PctCol Pct Female Male Total
1 32 25 57 32.312 24.688 10.19 7.96 18.15 56.14 43.86 17.98 18.38
2 25 21 46 26.076 19.924 7.96 6.69 14.65 54.35 45.65 14.04 15.44
3 32 22 54 30.611 23.389 10.19 7.01 17.20 59.26 40.74 17.98 16.18
4 89 68 157 89 68 28.34 21.66 50.00 56.69 43.31
50.00 50.00
Total 178 136 31456.69 43.31 100.00
Statistics for Table of CLUSTER by sex
Statistic DF Value Prob
Chi-Square 3 0.2550 0.9683
Likelihood Ratio Chi-Square 3 0.2553 0.9682
Mantel-Haenszel Chi-Square 1 0.0272 0.8689
Phi Coefficient 0.0285
Contingency Coefficient 0.0285
Cramer's V 0.0285
These results indicate that there is no statistically significant relationship between cluster and
gender (chi-square with 3 degree of freedom = 0.2550, p = 0.9683). The probability of no
correlation is 96% and the Chi-square value is very low; we clearly see that there is no any
correlation between cluster and gender, all three clusters features are acceptable for both
genders.
-
8/2/2019 BI Canetto
32/56
32
9.1.3 Cluster x AgeAnother variable to check is age, counted with the program:
procfreqdata=Coffee.compare3;
table cluster*age / allexpected;
run;
Frequency
Expected
Percent
Row Pct
Col Pct14-19 20-24 25-29 30-39 =>40 Total
1 1 26 21 5 4 57 1.0892 31.223 18.153 3.9936 2.5414
0.32 8.28 6.69 1.59 1.27 18.15
1.75 45.61 36.84 8.77 7.02
16.67 15.12 21.00 22.73 28.57
2 1 30 10 3 2 46 0.879 25.197 14.65 3.2229 2.051
0.32 9.55 3.18 0.96 0.64 14.65
2.17 65.22 21.74 6.52 4.35
16.67 17.44 10.00 13.64 14.29
3 1 30 19 3 1 54 1.0318 29.58 17.197 3.7834 2.4076
0.32 9.55 6.05 0.96 0.32 17.20
1.85 55.56 35.19 5.56 1.85
16.67 17.44 19.00 13.64 7.14
4 3 86 50 11 7 157
3 86 50 11 7 0.96 27.39 15.92 3.50 2.23 50.00
1.91 54.78 31.85 7.01 4.46
50.00 50.00 50.00 50.00 50.00
Total 6 172 100 22 14 314
1.91 54.78 31.85 7.01 4.46 100.00
Statistics for Table of CLUSTER by age
Statistic DF Value Prob
Chi-Square 12 6.0238 0.9149
Likelihood Ratio Chi-Square 12 6.2856 0.9010Mantel-Haenszel Chi-Square 1 0.5908 0.4421
Phi Coefficient 0.1385
Contingency Coefficient 0.1372
Cramer's V 0.0gg800
WARNING: 50% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
-
8/2/2019 BI Canetto
33/56
33
These results shuffler that there is no statistically significant relationship between cluster
attended and gender (chi-square with 12 degree of freedom = 6,0238, p = 0.9149).
On the other hand, SAS suggest that 50% of the cells have expected counts less than 5 and
Chi-Square may not be a valid test. For further calculations The Fishers test should be used.
The Fisher's exact test is used when you want to conduct a chi-square test, but one or more of
your cells has an expected frequency of five or less. Remember that the chi-square test
assumes that each cell has an expected frequency of five or more, but the Fisher's exact test has
no such assumption and can be used regardless of how small the expected frequency is.
We could use the program as follows:
proc freq data = Coffee.comapre3;tables cluster*age / fisher;
run;
On the other hand we clearly see that the majority of our respondents are 20-29 years old
(82%), so we focus on young people overall and further calculations are not necessary.
-
8/2/2019 BI Canetto
34/56
34
9.1.4 Cluster x SmokerOpening a cafe it is important to know, how the respondents relate with smoking, in order to
prepare places for smokers or not invest in it. Firstly we will determine the frequencies of
smokers in each cluster.
procfreqdata=Coffee.compare3;
table cluster*smoker/ allexpected;
run;
Frequency
Expected
Percent
Row Pct
Col Pct No Yes Total
1 19 38 57
33.764 23.236
6.05 12.10 18.15 33.33 66.67
10.22 29.69
2 25 21 46
27.248 18.752
7.96 6.69 14.65
54.35 45.65
13.44 16.41
3 49 5 54
31.987 22.013
15.61 1.59 17.20
90.74 9.26
26.34 3.91
4 93 64 157
93 64
29.62 20.38 50.00
59.24 40.76
50.00 50.00
Total 186 128 314
59.24 40.76 100.00
Statistics for Table of CLUSTER by smoker
Statistic DF Value Prob
Chi-Square 3 38.4895
-
8/2/2019 BI Canetto
35/56
35
This time results show that that there is statistically significant relationship between cluster and
smoking habits (chi-square with 3 degree of freedom = 38.49, p =
-
8/2/2019 BI Canetto
36/56
36
9.1.5 Cluster x The Time of the DayIn this part we will compare, if there is any link between time of the day to drink coffee (m_1)
and cluster. In this way, for example, the opening hours of cafe could be optimized.
We use the program:procfreqdata=Coffee.compare3;
table cluster*m_1/ allexpected;
run;
Frequency
Expected
Percent
Row Pct Col Pct After Afternoon Does not Evening Morning Usually Total
Meals matter when I
The time meet for
this purpose
1 7 1 23 0 18 8 57 6.172 2.9045 20.694 0.3631 20.331 6.535
2.23 0.32 7.32 0.00 5.73 2.55 18.15
12.28 1.75 40.35 0.00 31.58 14.04
20.59 6.25 20.18 0.00 16.07 22.22
2 7 2 14 1 19 3 46
4.9809 2.3439 16.701 0.293 16.408 5.2739
2.23 0.64 4.46 0.32 6.05 0.96 14.65
15.22 4.35 30.43 2.17 41.30 6.52
20.59 12.50 12.28 50.00 16.96 8.33
3 3 5 20 0 19 7 54
5.8471 2.7516 19.605 0.3439 19.261 6.1911
0.96 1.59 6.37 0.00 6.05 2.23 17.20
5.56 9.26 37.04 0.00 35.19 12.96
8.82 31.25 17.54 0.00 16.96 19.44
4 17 8 57 1 56 18 157
17 8 57 1 56 18
5.41 2.55 18.15 0.32 17.83 5.73 50.00
10.83 5.10 36.31 0.64 35.67 11.46
50.00 50.00 50.00 50.00 50.00 50.00
Total 34 16 114 2 112 36 314
10.83 5.10 36.31 0.64 35.67 11.46 100.00
Statistics for Table of CLUSTER by m_1
Statistic DF Value ProbChi-Square 15 10.6619 0.7762
Likelihood Ratio Chi-Square 15 11.1431 0.7424
Mantel-Haenszel Chi-Square 1 0.0290 0.8648
PhiCoefficient 0.1843
Contingency Coefficient 0.1812
Cramer's V 0.1064
WARNING: 33% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
-
8/2/2019 BI Canetto
37/56
37
These results show that there is no statistically significant relationship between cluster and
coffee drinking time (chi-square with fifteenth degree of freedom = 10.66, p = 0.7762). On the
other hand, SAS suggest that 33% of the cells have expected counts less than 5 and Chi-Square
may not be a valid test. For further calculations The Fishers test should be used.
On the other hand, we also notice that majority of respondents (36%) drink coffee in the
morning, or say, that time is not important (36%) , or that they do it when they meet other
people (11%). Just one respondent mark evening, so overall we can say that respondents are
used to drink coffee all the times, except evening. All the clusters have similar trend, so further
calculations are not necessary for our conclusion.
9.1.6 Cluster x The Type of CoffeeAnother variable to analyze is type of coffee (m_2) preferred by each cluster.
procfreqdata=Coffee.compare3;
table cluster*m_2/ allexpected;
run;
Frequency ,
Expected
Percent
Row Pct
Col Pct AmericanCaffee LCappucciEspressoOther, n Total
o atte no ot tradi
tional t
ypes
1 1 13 8 32 3 57
2.5414 13.796 11.981 25.777 2.9045
0.32 4.14 2.55 10.19 0.96 18.15
1.75 22.81 14.04 56.14 5.26
7.14 17.11 12.12 22.54 18.75
2 2 6 10 25 3 46
2.051 11.134 9.6688 20.803 2.3439
0.64 1.91 3.18 7.96 0.96 14.65
4.35 13.04 21.74 54.35 6.52
14.29 7.89 15.15 17.61 18.75
3 4 19 15 14 2 54
2.4076 13.07 11.35 24.42 2.7516
1.27 6.05 4.78 4.46 0.64 17.20
7.41 35.19 27.78 25.93 3.70
28.57 25.00 22.73 9.86 12.50
4 7 38 33 71 8 157
7 38 33 71 8
2.23 12.10 10.51 22.61 2.55 50.00
4.46 24.20 21.02 45.22 5.10
50.00 50.00 50.00 50.00 50.00
Total 14 76 66 142 16 314
m4.46 24.20 21.02 45.22 5.10 100.00
-
8/2/2019 BI Canetto
38/56
38
Statistics for Table of CLUSTER by m_2
Statistic DF Value Prob
Chi-Square 12 16.7882 0.1577Likelihood Ratio Chi-Square 12 17.7738 0.1227
Mantel-Haenszel Chi-Square 1 2.2114 0.1370
Phi Coefficient 0.2312
Contingency Coefficient 0.2253
Cramer's V 0.1335
WARNING: 30% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
The results show that there is no statistically significant relationship between cluster and type
of coffee preferred (chi-square with 12 degree of freedom = 16,79, p = 0.1577). Again, SAS
suggests that 30% of the cells have expected counts less than 5 and Chi-Square may not be a
valid test. For further calculations The Fishers test should be used.
On the other hand we see that Espresso is the of course the most popular type of coffee (45%),
respondent also choose Caffee Latte (24%) and Cappuccino (21%), other choices are not so
significant.
Cluster 1, analyzing just the frequencies is fonder of Espresso, which is 56% instead ofexpected 45%. They also like Caffe Latte (23%), but do not choose so much
Cappuccino (14% instead of 21%).
Cluster 2 respondents are also Espresso drinkers: 54% after expected 45%. Secondchoice is Cappucino (22%), but Caffe Latte is not so popular here ( 13%, instead of
24%).
Cluster 3 is of definitely Caffee Latte drinkers (35% instead of 24%). Their secondchoice is Cappuccino (27% instead of 21%), third- Espresso. On the other hand
Espresso frequency is quite lower than expected (26% instead of 45%)
-
8/2/2019 BI Canetto
39/56
39
9.1.7 Cluster x Times Per DayIn his part we will evaluate each cluster comparing with times per day (m_3) respondents drink
coffee.
procfreqdata=Coffee.compare3;table cluster*m_3/ allexpected;
run;
Frequency
Expected
Percent
Row Pct Col Pct 1 or less 2, 3, >3 , Total
1 13 20 11 13 57
17.427 21.783 8.7134 9.0764
4.14 6.37 3.50 4.14 18.15 22.81 35.09 19.30 22.81
13.54 16.67 22.92 26.00
2 12 22 6 6 46
14.064 17.58 7.0318 7.3248
3.82 7.01 1.91 1.91 14.65
26.09 47.83 13.04 13.04
12.50 18.33 12.50 12.00
3 23 18 7 6 54
16.51 20.637 8.2548 8.5987
7.32 5.73 2.23 1.91 17.20
42.59 33.33 12.96 11.11
23.96 15.00 14.58 12.00
4 48 60 24 25 157
48 60 24 25
15.29 19.11 7.64 7.96 50.00
30.57 38.22 15.29 15.92
50.00 50.00 50.00 50.00
Total 96 120 48 50 314
30.57 38.22 15.29 15.92 100.00
Statistics for Table of CLUSTER by m_3
Statistic DF Value Prob
Chi-Square 9 9.2367 0.4157
Likelihood Ratio Chi-Square 9 8.8972 0.4468
Mantel-Haenszel Chi-Square 1 1.6380 0.2006
Phi Coefficient 0.1715
Contingency Coefficient 0.1690
Cramer's V 0.0990
-
8/2/2019 BI Canetto
40/56
40
These results confirm zero hypothesis H0 that there is no statistically significant relationship
between cluster and times a day coffee is used (chi-square with 9 degree of freedom = 9.2367,
p = 0.4157). The probability of no correlation is 41%. There are no very significant differences
analyzing one by one clusters and variable answers. The only notice could be made, that in
Cluster 3 respondents choose more often than expected drinking coffee less than once a day.
We have 43% instead of expected 31% frequency. Knowing the characteristics of clusters, we
can say, that probably respondents relate coffee with socializing, dessert, not every day routine.
Analyzing frequencies in the Cluster 1 and Cluster 2 respondents usually choose coffee twice a
day. Also in cluster 3 twice a day choice is significant.
9.1.8 Cluster x Way of Drinking CoffeeThis time we will compare cluster and the way of drinking coffee (m_4), using the program:
procfreqdata=Coffee.compare3;table cluster*m_4/ allexpected;
run;
Frequency
Expected
Percent
Row Pct
Col Pct Bar Take Sitting in Total
Away Cafe
1 11 11 35 57
13.433 11.618 31.949
3.50 3.50 11.15 18.15
19.30 19.30 61.40
14.86 17.19 19.89
2 18 9 19 46
10.841 9.3758 25.783
5.73 2.87 6.05 14.65
39.13 19.57 41.30
24.32 14.06 10.80
3 8 12 34 54
12.726 11.006 30.268
2.55 3.82 10.83 17.20
14.81 22.22 62.96
10.81 18.75 19.32
4 37 32 88 157
37 32 88
11.78 10.19 28.03 50.00
23.57 20.38 56.05
50.00 50.00 50.00
Total 74 64 176 314
23.57 20.38 56.05 100.00
-
8/2/2019 BI Canetto
41/56
41
Statistics for Table of CLUSTER by m_4
Statistic DF Value Prob
Chi-Square 6 9.5977 0.1426
Likelihood Ratio Chi-Square 6 9.2570 0.1596Mantel-Haenszel Chi-Square 1 0.0296 0.8633
Phi Coefficient 0.1748
Contingency Coefficient 0.1722
Cramer's V 0.1236
The results confirm zero hypothesis H0 that there is no statistically significant relationship
between cluster and way to drink coffee is used (chi-square with 6 degree of freedom = 9.5977,
p = 0.1426). The probability of no correlation is not so strong- 14%.
Analyzing frequencies one by one, we notice some different values than expected in Cluster 2,
where respondent choose more often than usual to drink coffee fast next to the bar (39%
instead of expected 24%), and less than usual taking a cup of coffee without a hurry in a caf
(41% instead of 56%)
Overall, looking at all the sample, sitting in a caf, taking the time is most popular way to drink
coffee (56%) take-away (20%) and fast coffee in a bar (24%) have more or less the same
popularity
-
8/2/2019 BI Canetto
42/56
42
9.2 Country X VariableIn this part we will compare our variable Country, with other qualitative variables, related to
personal coffee drinking habit (questions m_1-m_4). As mentioned before, for marketing
decisions variable country is important to analyze, because of cultural differences.
9.2.1 Country x The Time of the DayFirstly we compare variable country, with time of the day to drink coffee (m_1), using the
program:
procfreqdata=Coffee.compare3;
table country*m_1/ allexpected;
run;
Frequency
Expected
Percent
Row Pct
Col Pct After meAfternooDoes notEvening Morning Usually Total
als n matter when I m
the time eet othe
r people
for thi
s purpos
e
Italy 20 4 34 2 42 0 102
11.045 5.1975 37.032 0.6497 36.382 11.694
6.37 1.27 10.83 0.64 13.38 0.00 32.48
19.61 3.92 33.33 1.96 41 in 0.00
58.82 25.00 29.82 100.00 37.50 0.00
Lithuanian 10 12 48 0 40 16 126 13.643 6.4204 45.745 0.8025 44.943 14.446
3.18 3.82 15.29 0.00 12.74 5.10 40.13
7.94 9.52 38.10 0.00 31.75 12.70
29.41 75.00 42.11 0.00 35.71 44.44
Palestine 4 0 32 0 30 20 86
9.3121 4.3822 31.223 0.5478 30.675 9.8599
1.27 0.00 10.19 0.00 9.55 6.37 27.39
4.65 0.00 37.21 0.00 34.88 23.26
11.76 0.00 28.07 0.00 26.79 55.56
Total 34 16 114 2 112 36 314
10.83 5.10 36.31 0.64 35.67 11.46 100.00
-
8/2/2019 BI Canetto
43/56
43
Statistics for Table of country by m_1
Statistic DF Value Prob
Chi-Square 10 49.0229
-
8/2/2019 BI Canetto
44/56
44
9.2.2 Country x Type of CoffeeAnother variable to compare by countries is the type of coffee (m_2).
procfreqdata=Coffee.compare3;table country*m_2/ allexpected;
run;
Frequency
Expected
Percent
Row Pct
Col Pct AmericanCaffee LCappucciEspressoOther, n Total
o atte no ot tradi
tional t
ypes
Italy 6 6 8 82 0 102
4.5478 24.688 21.439 46.127 5.1975
1.91 1.91 2.55 26.11 0.00 32.48
5.88 5.88 7.84 80.39 0.00
42.86 7.89 12.12 57.75 0.00
Lithuanian 6 64 22 30 4 126
5.6178 30.497 26.484 56.981 6.4204
1.91 20.38 7.01 9.55 1.27 40.13
4.76 50.79 17.46 23.81 3.17
42.86 84.21 33.33 21.13 25.00
Palestine 2 6 36 30 12 86
3.8344 20.815 18.076 38.892 4.3822
0.64 1.91 11.46 9.55 3.82 27.39
2.33 6.98 41.86 34.88 13.95
14.29 7.89 54.55 21.13 75.00
Total 14 76 66 142 16 314
4.46 24.20 21.02 45.22 5.10 100.00
Statistics for Table of country by m_2
Statistic DF Value Prob
Chi-Square 8 151.8787
-
8/2/2019 BI Canetto
45/56
45
There definitely is statistically significant relationship between country and type of the coffee
(chi-square with eight degree of freedom = 151.87, p =
-
8/2/2019 BI Canetto
46/56
46
Palestine 30 24 18 14 86
26.293 32.866 13.146 13.694
9.55 7.64 5.73 4.46 27.39
34.88 27.91 20.93 16.28
31.25 20.00 37.50 28.00
Total 96 120 48 50 314
30.57 38.22 15.29 15.92 100.00
Statistics for Table of country by m_3
Statistic DF Value Prob
Chi-Square 6 29.5616
-
8/2/2019 BI Canetto
47/56
47
9.2.4 Country X The Way of DrinkingFinally we will compare country with the way of drinking coffee (m_4).
procfreqdata=Coffee.compare3;
table country*m_4/ allexpected;
run;
Frequency
Expected
Percent
Row Pct
Col Pct Bar Take Sitting in
Away Cafe
Italy 66 12 24 102
24.038 20.79 57.172
21.02 3.82 7.64 32.48
64.71 11.76 23.53
89.19 18.75 13.64
Lithuanian 4 36 86 126
29.694 25.682 70.624
1.27 11.46 27.39 40.13
3.17 28.57 68.25
5.41 56.25 48.86
Palestine 4 16 66 86
20.268 17.529 48.204
1.27 5.10 21.02 27.39
4.65 18.60 76.74
5.41 25.00 37.50
Total 74 64 176 314
23.57 20.38 56.05 100.00
Statistics for Table of country by m_4
Statistic DF Value Prob
Chi-Square 4 145.6996
-
8/2/2019 BI Canetto
48/56
48
of 56%), moreover take-away culture is not so common for this culture (12% instead of
20%).
Lithuanians opposite, more than expected like to take their time to drink a cup of coffee(68% instead of 56%). If not this choice, saving time Lithuanians take-away their cupof coffee (29%). They do not have a habit to drink coffee in a hurry just next to the bar
(3% instead of expected 24 %).
Palestinians are more similar to Lithuanians, than Italians. Firstly they prefer takingtheir time to have a cup of coffee (76% instead of expected 56%). 19% of Palestinians
like take-away coffee. On the other hand just a few of them (5% instead of expected
24%) are for taking fast coffee next to the bar.
-
8/2/2019 BI Canetto
49/56
49
9.2.5 The Most Significant Features of The CountriesAfter comparing three countries with different variables, we can recognize some obvious
features and differences between Italy, Lithuania and Palestine.
9.2.5.1ItalyItalians are people with specific coffee drinking traditions. Firstly they usually drink coffee
twice: in the morning and after the meals, or any other time of the day. Italians prefer taking
Espresso and more likely fast, next to the bar. These would be the most significant features of
Italian respondents.
9.2.5.2LithuaniaLithuanians drink coffee once or twice a day, usually morning and then any other time of the
day, often with the purpose of socializing. Lithuanians, despite most popular coffee- Espresso,
moreover they are real Coffee Latte lovers. Moreover they enjoy sitting in the caf and takingtheir time.
9.2.5.3PalestinePalestinians are more similar to Lithuania, than Italy. Palestinians usually drink coffee in the
morning, and the not related to the timetable, for the purpose of meeting people and
socializing. Palestinians drink coffee or really rarely, like less than once a day, or three or more
times. Palestinians have a strong preference for Cappuccino; moreover, as everywhere,
Espresso is also important. Palestinians more than other people like different, not traditional
tastes of coffee. This nation the same as Lithuanians, have a strong preference for taking theirtime in a caf for coffee.
-
8/2/2019 BI Canetto
50/56
50
10 Strategic DecisionsFinally we came to the conclusion, where we will determine the different groups with theirparticular characteristics and habits, which tell us what kind of caf would be popular for each
group. Firstly we will determine the most significant features of each country. Later we will
add these features making strategic marketing decision what kind of caf to open in ach area.
10.1 Cluster 1- Sophisticated Coffee and Cigarettes
As we saw above, cluster 1 is a group of people, who enjoy smoking, good interior andatmosphere, and socializing during their coffee drinking process. Respondents with mentioned
characteristics are spread in all three countries, without any significant differences from
expected frequencies. This cluster has a significant feature- most of the respondents are
smokers. Moreover, as common for all the sample, people in the cluster mostly choose taking
their time to drink coffee.
For this group of respondents fashionable cafes would be opened. The biggest attention should
be paid for creating of interior and cozy atmosphere, inviting to stay inside longer. Even it is
forbidden to smoke inside; the smoking area should be available, with heaters for winter(especially in Lithuania). The cafes should be situated in strategically comfortable location for
meetings. While Italians tend to drink coffee fast, next to the bar, the caf in Italy should have
less places for sitting and more attention should be paid for attractive bar to take fast coffee on
the way, or with cigarette outside. In other countries a bar is not necessary; more attention must
be paid to creating enough places to sit. The menu in a sophisticated place should include
traditional types of coffee.
-
8/2/2019 BI Canetto
51/56
51
10.2 Cluster 2- Fast Coffee
Cluster 2 describes people, whose preference is good quality strong coffee, mostly Espresso,
moreover Ice- Coffee (we assume that in summer season). Here people do not like spend a lotof time for the process, they prefer fast coffee. Moreover Cluster 2 has a higher frequency of
Italians, than expected.
The caf to satisfy the needs of the group described by Cluster 2 should be simple coffee bar in
convenient locations: next to offices, city center shopping places, universities, lunch
restaurants. Good quality strong coffee and ice coffee choice are summer is essential features.
No investment should be made in extended menus with not traditional tastes of coffee.
Knowing the features of coffee drinking habits in Italy and the fact, that cluster has a
significant number of Italians, firstly these coffee bars should be opened in Italy. Knowing thatthere is a big competition of similar concept places in this country, we would compete with
good quality coffee, convenient locations and fast service or the possibility of self service to
make the process less time consuming.
-
8/2/2019 BI Canetto
52/56
52
10.3 Cluster 3- Sweet Break or Take-Away
Cluster 3 describes people, who like meeting other people, and while socializing, having a cup
of coffee, with different tastes, or a dessert next to eat. Their preferred type of coffee in Caffe
Latte and probably its variations. These respondents also choose Take-Away coffee. This way
of coffee relating with socializing and dessert is more common to Lithuanians.
Here Starbucks style coffee place should be opened. The caf would offer an extended menu
of different tastes. Most of the attention should be paid to Caffee Latte with different syrups.
An attractive dessert menu should be available. The place should be cozy to spend some time
there, but simple, keeping the prices low.
Knowing the features of three countries, and the fact that this cluster is more common to
Lithuanians, firstly we open this concept cafes in Lithuania. Also Palestine shouldnt beforgotten, as they express their preference for different types of coffee, drinking it many times
a day and taking their time for the process. For the moment we should not invest in opening
such lace in Italy, as there people have slightly different habits.
-
8/2/2019 BI Canetto
53/56
53
11 AppendixAppendix 1. Questionnaire.
Coffee drinking habits in Palestine, Lithuania and Italy
We are making a research about the preferences of cafe and coffee drinking habits in three different
cultures, with the purpose to find the best concept of cafe in each location. So the questionnaire is
oriented to your coffee drinking habits in cafes, not at home (if, for example you drink coffee every
morning at home, and later in a cafe with other people, please relate the answers more with the second).
Please keep that in mind answering the questions. If you do not drink coffee, do not fill the
questionnaire. We are kindly asking to fill the questionnaire just if you are originally from Lithuania,
Italy or Palestine. For each multiple choice question choose just one best answer. For scale typequestions, evaluate the argument or answer the question, when 1 is the most negative answer, 10 is the
most positive. The survey is absolutely anonymous.
1. Choose one favourite type of coffee, which you usually drink, from the list below: * Espresso Americano Cappuccino Caffee Latte Other, not traditional types
2. When do you usually drink coffee? * Morning Afternoon Evening After meals Does not matter the time Usually when I meet other people for this purpose
3. How many times a day do you usually drink coffee? * 1 or less 2 3
-
8/2/2019 BI Canetto
54/56
-
8/2/2019 BI Canetto
55/56
-
8/2/2019 BI Canetto
56/56
Working person Unemployed Other
18.Are you smoker? * No Yes