bi canetto

Upload: aleksandra-vinokurova

Post on 05-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 BI Canetto

    1/56

    CLAMDA - INTERNATIONAL MANAGEMENTFACULTY OF ECONOMICS

    Coffee Drinking Habits in Lithuania, Palestine and Italy

    Business Intelligence Written Assignment

    Kotryna Garsvaite

    Mahran Sharqawi

    Marcello Canetto

    Professor Furio Camillo

  • 8/2/2019 BI Canetto

    2/56

    2

    CONTENT

    Introduction...................................................................................................................................3

    1 The Research .........................................................................................................................4

    2 Importing Data.......................................................................................................................5

    3 Simple Statistics ....................................................................................................................7

    4 Principal Component Analysis ..............................................................................................8

    5 Size Effects Removal .........................................................................................................12

    5.1 Principal Component Analysis after Size Effect Removal ..........................................14

    6 The Cluster Analysis ...........................................................................................................18

    7 Wards Method ....................................................................................................................20

    7.1 Dendrogram - graphical representation........................................................................21

    8 T-TEST Procedure...............................................................................................................23

    8.1 Preparation of the dataset .............................................................................................23

    8.2 T-Test for Respondents ................................................................................................23

    8.3 Cluster 1 .......................................................................................................................25

    8.4 Cluster 2 .......................................................................................................................26

    8.5 Cluster 3 .......................................................................................................................27

    9 Proc Freq Procedure (Chi Square Test) ...............................................................................28

    9.1 Cluster x Variable ........................................................................................................29

    9.2 Country X Variable ......................................................................................................42

    10 Strategic Decisions ..........................................................................................................50

    10.1 Cluster 1- Sophisticated Coffee and Cigarettes ...........................................................50

    10.2 Cluster 2- Fast Coffee ..................................................................................................51

    10.3 Cluster 3- Sweet Break or Take-Away ........................................................................52

    11 Appendix..........................................................................................................................53

  • 8/2/2019 BI Canetto

    3/56

    3

    Introduction

    The energizing effect of the coffee bean plant is thought to have been discovered in thenortheast region of Ethiopia, and the cultivation of coffee first expanded in the Arab world.

    The earliest credible evidence of coffee drinking appears in the middle of the 15th century, in

    the Sufi monasteries ofYemen in southern Arabia. From the Muslim World, coffee spread to

    Italy, then to the rest of Europe.

    Coffee had been through many centuries a popular drink. Searching through history pages for

    the roots of this amazing drink, can lead to lot of stories, legends, we may say. From the South

    American countries as Brazil, through the northeast region of Ethiopia, to the Arab peninsula

    reaching Europe, people had used coffee for its stimulating effect on humans due to its caffeinecontent.

    Because of the popularity and attractiveness of coffee and possibility to spread the

    questionnaire in different countries, we chose as a subject of our research the topic Coffee

    Drinking Habits in Lithuania, Palestine and Italy. Our goal is to find the best concept of cafes

    for different groups of people in three different counties.

    http://en.wikipedia.org/wiki/Ethiopiahttp://en.wikipedia.org/wiki/Arabhttp://en.wikipedia.org/wiki/Sufismhttp://en.wikipedia.org/wiki/Yemenhttp://en.wikipedia.org/wiki/Arabiahttp://en.wikipedia.org/wiki/Muslim_worldhttp://en.wikipedia.org/wiki/Muslim_worldhttp://en.wikipedia.org/wiki/Arabiahttp://en.wikipedia.org/wiki/Yemenhttp://en.wikipedia.org/wiki/Sufismhttp://en.wikipedia.org/wiki/Arabhttp://en.wikipedia.org/wiki/Ethiopia
  • 8/2/2019 BI Canetto

    4/56

    4

    1 The ResearchIn our research, we tried to examine the habits of drinking coffee outside home, in three

    different countries, Italy, Lithuania and Palestine. To achieve this goal we have formulated aquestionnaire, mainly aimed to people in these three different countries, located in different

    points on the map, having different climates and of course different cultures. Our purpose is to

    try to find the differences between the habits in drinking coffee in these three countries, in

    addition, to find similarities between specific groups in these countries.

    We are also aware of the wide range of respondents, and the effect of other factors to their

    answers. Coffee has different perception in these different countries, still, worldwide network

    companies such as Starbucks, may have a similar effect on consumers in different places.

    Through our questionnaire we tried to get information about drinking coffee habits such as thetype of coffee preferred, times people prefer to drink coffee, and other factors which are

    important for people who drink their coffee outside.

    The questionnaire was worded as clearly as possible to try and give everyone the ability to

    understand the questions and answers with no errors. Another feature is its simplicity; we tried

    to structure the questions as simply as possible so that respondents could answer the questions

    at the minimum time available.

    The result was a questionnaire of 18 questions (Appendix 1).

    First we placed four qualitative questions; favorite type of coffee, frequency of drinking coffee,

    how many times, and the modality of drinking coffee outside.

    Then, we identified 9 factors we consider as important for our research in order to understand

    the reasons behind the decisions made by the respondents, we formulated 9 questions and

    asked to rate them with a scale from 1 to 10, where 1 is the most negative/ not important

    evaluation, 10-the most positive/ important.

    These 9 factors are:

    1.

    Interior and atmosphere of the place.

    2. Socializing with people.3. Effect of caffeine.4. Traditional tastes of coffee.5. Importance of the price.

  • 8/2/2019 BI Canetto

    5/56

    5

    6. Smoking.7. "Take-away culture".8. Ice coffee.9. Dessert/croissant.

    As our target audience was from three different countries, and it is hard to reach them

    physically in order to hand the questionnaire, we reached them through internet. We published

    the questionnaire online for four days;

    We also placed the generic questions in the end of the questionnaire in order to identify our

    respondents

    At the end our sample was 157 useful observations, around 40-60 from each country. As the

    most active were Lithuanians, the most passive- Palestinians.

    The data were originally cataloged by Microsoft Excel and later imported into

    SAS software.

    2 Importing DataThe first step was to import the data from Excel to SAS using the import function Wizard.

    Then we renamed and labeled the questions as follows:

    ID id

    1. Question m_1 (type of coffee) we gave it the label type2. Question m_2 (time of drinking) we gave it the label time3. Question m_3 (times drinking) we gave it the label times a day4. Question m_4 (preferred drinking way) we gave it the label way to drink5. Next we placed 9 sub questions which asked the importance of different

    factors in choosing for the respondents.

    6. Question s_1 (interior and atmosphere) we gave it the label interior7. Question s_2 (socializing with people) we gave it the label socializing8. Question s_3 (effect of caffeine) we gave it the label caffeine9. Question s_4 (traditional tastes) we gave it the label tastes10.Question s_5 (importance of price) we gave it the label price

  • 8/2/2019 BI Canetto

    6/56

    6

    11.Question s_6 (relating smoking) we gave it the label smoking12.Question s_7 (take away cultural) we gave it the label take away13.Question s_8 (ice coffee) we gave it the label ice coffee14.Question s_9 (dessert/croissant) we gave it the label dessert15.Question Country we gave it the label country16.Question Gender we gave it the label sex17.Question Age we gave it the label age18.Question 8 Occupation we gave it the label occupation19.Question 9 Smoking we gave it the label smoker

    To do so, we had the following commands in SAS:

    data Coffee.Coffee;

    set Coffee.Coffee;

    label id='id'

    m_1='time'

    m_2='type'

    m_3='times a day'

    m_4='way to drink'

    s_1='interior'

    s_2='socializing's_3='caffeine'

    s_4='tastes'

    s_5='price'

    s_6='smoking'

    s_7='take away'

    s_8='ice coffee'

    s_9='dessert'

    country='country'

    gender='gender'

    age='age'

    occupation='occupation'

    smoker='smoker'

    run;

    Our data was ready for the analysis of the values in the respective tables.

  • 8/2/2019 BI Canetto

    7/56

    7

    3 Simple StatisticsWhen our data was, sorted, and renamed, we started with the first data analysis.

    The first procedure that we started with is the PROC MEANS. To do this we gave SAS the

    command:

    procmeansdata=Coffee.Coffee nmeanstddevstddevstderrmediancv;

    var s_1-s_9;

    run;

    With this procedure we can know the number (n), the average (mean), the

    standard deviation (stddev), the standard error (stderr), the median (median) and the coefficient

    of variation (cv).

    The MEANS Procedure

    Coeff of

    Variable Label N Mean Std Dev Std Error Median Variation

    s_1 interior 157 6.8343949 2.4543189 0.1958760 8.0000000 35.9112829

    s_2 socializing 157 6.3312102 2.4215606 0.1932616 7.0000000 38.2479888

    s_3 caffeine 157 6.5796178 2.6339193 0.2102096 7.0000000 40.0314930

    s_4 tastes 157 4.1528662 3.0237640 0.2413226 3.0000000 72.8114953

    s_5 price 157 5.3630573 2.6485431 0.2113767 5.0000000 49.3849478

    s_6 smoking 143 4.7482517 3.8409736 0.3211983 3.0000000 80.8923740

    s_7 take away 157 6.1401274 2.8699562 0.2290474 6.0000000 46.7409882

    s_8 ice coffee 157 5.5732484 3.1136668 0.2484977 6.0000000 55.8680781

    s_9 dessert 157 5.3503185 2.7939179 0.2229789 5.0000000 52.2196568

    In this table, we marked the highest means with lowest standard deviation, which give us a

    clear perspective of the important variables in our research. For example, high value in interiorvariable means that respondents give this factors a high importance- choosing the place for

    drinking coffee. Moreover, respondents gave high importance for Socializing, Take- Away

    option. That tells that for those who are choosing to drink coffee the important factors are

    connected to three factors which are not connected to coffee itself, but to the habit of drinking

    coffee. Another important fact is that Caffeine is one of the highest four mean values we got,

    which tells, that there is a part of respondents relating coffee with its primary feature- caffeine.

  • 8/2/2019 BI Canetto

    8/56

    8

    4 Principal Component AnalysisIn this part we try to find out possible relationship between different variables. In other words

    we want to see if there is any relationship between the different possible answers to the

    questionnaire. To do this you need to do a multivariate analysis of responses, and this is donethrough principal component analysis.

    In SAS we use program:

    procprincompdata=Coffee.Coffee;

    var s_1-s_9;

    run;

    In this way we will have 3 useful results to be analyzed:

    1) Correlation coefficientsCorrelation Matrix

    s_1 s_2 s_3 s_4 s_5 s_6 s_7 s_8 s_9

    s_1 interior 1.0000 0.5319 -.1563 0.2340 0.1566 -.0505 0.3069 0.0336 0.1214

    s_2 socializing 0.5319 1.0000 -.0171 0.3303 0.1707 -.0478 0.3081 0.0174 0.1539

    s_3 caffeine -.1563 -.0171 1.0000 0.0022 0.1269 -.0377 -.0592 0.0135 0.1890

    s_4 tastes 0.2340 0.3303 0.0022 1.0000 0.1767 -.1145 0.3631 0.1942 0.2501

    s_5 price 0.1566 0.1707 0.1269 0.1767 1.0000 0.1137 -.0102 0.0520 0.0889

    s_6 smoking -.0505 -.0478 -.0377 -.1145 0.1137 1.0000 -.0909 -.0998 -.1030

    s_7 take away 0.3069 0.3081 -.0592 0.3631 -.0102 -.0909 1.0000 0.3366 0.1401

    s_8 ice coffee 0.0336 0.0174 0.0135 0.1942 0.0520 -.0998 0.3366 1.0000 0.2037

    s_9 dessert 0.1214 0.1539 0.1890 0.2501 0.0889 -.1030 0.1401 0.2037 1.0000

    The first observation concerning the correlation coefficients is that they are mixed between

    positive and negative values; the majority of the values are positive, and only around 10 cases

    we had a negative values. We highlighted the highest values in yellow and the lowest in light

    blue.

    Regarding this time the positive values, the highest value is 0.5319 and it indicates the

    correlation between Socializing and Interior/Atmosphere, we can conclude that respondents

    who gave importance to the Socializing with people while drinking coffee, gave a big

    importance too to the Interior of the cafe and vice versa.

  • 8/2/2019 BI Canetto

    9/56

    9

    Even when we had only one correlation above 0.5 we still consider values above 0.2 as high

    values, as we are testing a wide range of respondents. Continuing in the standings to second

    place we find the correlation between Tastes and Take -Away which had a high value of

    0.3631. We assume that respondents who like to take their coffee away with them, care about

    the different tastes of coffee.

    The third place in our analysis is the correlation between Ice Coffee and Take-Away, with a

    value of 0.3366, it could be concluded from this that the ice coffee lovers, take it away.

    Moreover, Take-Away people, next to mentioned before different tastes, like also Ice Coffee.

    Another positive correlation is Tastes and Dessert (0,25), respondents who like different,

    probably sweet tastes of coffee, do not refuse also dessert.

    The last high value in our analysis in this table is the correlation between Tastes and

    Socializing. It says that people like spending time with others in a coffee place, which can offer

    various tastes. Furthermore, Tastes have another positive correlation of 0.234 with Interior.

    The negative values show the features, which do not correlate with each other (Caffeine and

    Socializing, Smoking with Tastes, Take-Away, Ice Coffee, Dessert).

    The observations give us the first view of the trends in our research, which will counted more

    precisely in later calculations.

    The values are not all positive, which minimize the possibility of having an error called size

    effect. Still the data must be corrected to verify any existence error and its possible influence

    on the results obtained. To do this we will use a procedure which will be shown later.

    Now we continue with the second part of the PRIN COMP analysis.

    2) Correlation matrix eigenvaluesWe noted the presence of positive correlation between the variables, but we also got negative

    correlation, still we decided to try to eliminate the size effect. For simplicity of the procedure

    PRIN COMP in SAS generates new vectors defining a new vector system that is composed of

    new, independent and unrelated dimensions. Each principal component is the linear

    combination of original variables with the coefficient equal to eigenvector of the correlation

    matrix.

  • 8/2/2019 BI Canetto

    10/56

    10

    Eigenvalues of the Correlation Matrix

    Eigenvalue Difference Proportion Cumulative

    1 2.30618151 0.98494347 0.2562 0.2562

    2 1.32123803 0.09821232 0.1468 0.4030

    3 1.22302572 0.23309129 0.1359 0.5389

    4 0.98993443 0.19748232 0.1100 0.6489

    5 0.79245210 0.05262728 0.0881 0.7370

    6 0.73982482 0.03803349 0.0822 0.8192

    7 0.70179132 0.20455790 0.0780 0.8972

    8 0.49723342 0.06891478 0.0552 0.9524

    9 0.42831865 0.0476 1.0000

    The first column shows the length of the eigenvalue of the principal components.

    We are interested in considering the eigenvalues to determine the importance of Principal

    Components. The first 3 eigenvalues have a value greater than one and therefore the most

    significant. However, considering the variance, we note that considering only the first three

    would stop at 53% of variance explained.

    Other components show lower importance, but still represent 5% and more variables. Ourtarget is not specific so not to loose information we consider all the 9 principal components,

    which let us to explain the total variance.

  • 8/2/2019 BI Canetto

    11/56

    11

    3) EigenvectorsPrin1 Prin2 Prin3 Prin4

    s_1 interior 0.434038 -.417329 0.046529 -.169922

    s_2 socializing 0.461075 -.313510 0.160018 -.240113

    s_3 caffeine -.010706 0.517591 0.462081 -.245532

    s_4 tastes 0.447888 0.103031 0.018111 0.051922

    s_5 price 0.187179 -.016211 0.637148 0.248980

    s_6 smoking -.129733 -.273119 0.374877 0.665444

    s_7 take away 0.441400 0.033958 -.319502 0.240765

    s_8 ice coffee 0.260091 0.421480 -.287120 0.516211

    s_9 dessert 0.289749 0.442015 0.165447 -.14574

    Prin5 Prin6 Prin7 Prin8 Prin9

    s_1 0.079371 -.036478 0.389923 -.050380 0.666483

    s_2 0.143574 0.169851 0.091209 0.437747 -.597040

    s_3 0.180006 0.609017 0.062985 0.064716 0.216151

    s_4 -.151813 -.070963 -.787577 0.244862 0.278280

    s_5 -.606727 -.119484 0.125885 -.279837 -.142138

    s_6 0.546455 0.030417 -.112881 0.092286 0.066560

    s_7 0.141779 0.417523 -.065994 -.640667 -.186355

    s_8 -.155812 0.007970 0.418987 0.450530 0.054111

    s_9 0.454455 -.635840 0.083081 -.203562 -.113544

    In the first column Prin1 we count 7 variables positively correlated and 2 variables negativelycorrelated. This observation shows us that there is not such a significant size effect. However,

    before further considerations, we will erase the size effect to improve the result of our analysis.

  • 8/2/2019 BI Canetto

    12/56

    12

    5 Size Effects RemovalThe operation of size effects removal finds reason in the fact that the values which has been

    allocated to the factor of the questionnaire depend to the average value of the judgments of oneperson. These values can greatly change so we'll find very low values in all people have a

    more pessimistic view, while higher values in those who are more accustomed to giving high

    values to different parameters (optimistic).

    The process of size effects removal is made through standardization procedure in SAS. We

    start creating 9 new variables (n_1- n_9), they will represent the new values of the 9 scale

    questions. These values will be centered to the average value of each

    individual. SAS will calculate the maximum, the minimum and the average value or each

    individual. Consequently the software will standardize the answers given in a range between -1

    and +1. The average value will be represented by 0.

    We gave to SAS the following command:

    data data Coffee.Coffee_1;

    set data Coffee.Coffee;

    if _n_

  • 8/2/2019 BI Canetto

    13/56

    13

    Now, using our new database, we repeat the initial procedures to control the actual difference

    between the original database and the corrected one.

    As before, we are going to use the PROC MEANS procedure so in our SAS program we will

    write down:

    procmeansdata=Coffee.Coffee_1;

    var n_1-n_9;

    run;

    The result is:

    The MEANS Procedure

    Variable N Mean Std Dev Minimum Maximum

    n_1 157 0.3196513 0.5870416 -1.0000000 1.0000000

    n_2 157 0.1830072 0.5670946 -1.0000000 1.0000000

    n_3 157 0.2677898 0.6738488 -1.0000000 1.0000000

    n_4 157 -0.3539103 0.6861465 -1.0000000 1.0000000

    n_5 157 -0.0736206 0.6409227 -1.0000000 1.0000000

    n_6 157 -0.1591141 0.8576755 -1.0000000 1.0000000

    n_7 157 0.1395236 0.6865145 -1.0000000 1.0000000

    n_8 157 0.0133477 0.7477820 -1.0000000 1.0000000

    n_9 157 -0.0531759 0.6824755 -1.0000000 1.0000000

    Initially we notice that after the standardization the values are within values -1 and +1

    The standardization doesnt show surprising effects. In fact the three factors with the highest

    value didnt change: Interior, Socializing and Caffeine. However the factor Take-Away,

    which it is still at the fourth place, lost its decisive role as factor and loses importance in theanalysis.

    Looking at the negative signs (highlighted with blue) we report in order: Tastes, Smoking and

    Price. The trend is again similar to the one seen in the procedure before size effect removal..

  • 8/2/2019 BI Canetto

    14/56

    14

    5.1 Principal Component Analysis after Size Effect RemovalAfter size effects removal we can repeat the Principal Component procedure using the new

    more precise database.

    In SAS program we write:

    Procprincompdata=Coffee.Coffee_1 out=Coffee.cluster;

    var n:;

    run;

    Correlation Matrix

    n_1 n_2 n_3 n_4 n_5 n_6 n_7 n_8 n_9

    n_1 1.0000 0.3349 -.1629 -.0016 0.0090 -.1434 0.1069 -.1875 -.0245

    n_2 0.3349 1.0000 -.1091 0.1039 0.0056 -.2060 0.0633 -.2269 -.0505

    n_3 -.1629 -.1091 1.0000 -.2004 0.0376 -.1189 -.2219 -.0620 0.0918

    n_4 -.0016 0.1039 -.2004 1.0000 -.0055 -.2079 0.1704 0.0148 0.1436

    n_5 0.0090 0.0056 0.0376 -.0055 1.0000 0.0012 -.1916 -.1977 -.0554

    n_6 -.1434 -.2060 -.1189 -.2079 0.0012 1.0000 -.2312 -.1898 -.1804

    n_7 0.1069 0.0633 -.2219 0.1704 -.1916 -.2312 1.0000 0.1870 -.0672

    n_8 -.1875 -.2269 -.0620 0.0148 -.1977 -.1898 0.1870 1.0000 0.0374

    n_9 -.0245 -.0505 0.0918 0.1436 -.0554 -.1804 -.0672 0.0374 1.0000

    The outcome of this correlation matrix shows some differences respect to the previous one

    without size effects removal.

    The highest correlation value (0,3349) is Socializing with Interior. The second highest

    correlation (0,1870) matches up Ice coffee with Take-Away. The third one (0,1704) correlates

    Take-Away with different Tastes.

    Analysing the correlations we can argue that the factors more correlated are similar or can have

    logic correlation: if a person links coffee with socializing with other people, then he will

    choose a caf with nice interior to spend his time with another person.

    Further Ice coffee is a long coffee which needs time to be finished so people can take it away

    to drink it slowly.

  • 8/2/2019 BI Canetto

    15/56

    15

    Furthermore the negative correlations let us to make some important conclusions. Firstly, the

    highest negative correlation (-0,2312) shows that people do not link Smoking with Take-Away.

    Also the negative correlation (-0,2269) shows how people do not relate Ice coffee with

    Socializing.

    Eigenvalues of the Correlation Matrix

    Eigenvalue Difference Proportion Cumulative

    1 1.73942362 0.19850007 0.1933 0.1933

    2 1.54092355 0.26656185 0.1712 0.3645

    3 1.27436170 0.23053503 0.1416 0.5061

    4 1.04382667 0.11817241 0.1160 0.6221

    5 0.92565426 0.15262272 0.1029 0.7249

    6 0.77303154 0.09324188 0.0859 0.8108

    7 0.67978966 0.09613972 0.0755 0.8863

    8 0.58364994 0.14431088 0.0648 0.9512

    9 0.43933906 0.0488 1.0000

    The eigenvalues of the new correlation matrix has 4 principal components with values greater

    than unity that we want to consider. However, they explain only 62% of the information, which

    would lead us to consider at least 5 components.

    After some considerations we still decided to choose all the 9 variables for our analysis

    because our sample is not perceived as homogeneous and specific. It is composed by people

    from three different countries and subsequently diverse cultures.

    We need all the principal components because we require specific and detailed data. We want a

    macro view of the costumer as the scope of this research is to make some macro marketing and

    strategic general decision about opening cafs in different countries.

  • 8/2/2019 BI Canetto

    16/56

    16

    Eigenvectors

    Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9

    n_1 0.371810 0.427576 -.021365 -.273788 -.143199 0.477643 0.150053 -.541602 0.197539

    n_2 0.394858 0.444974 0.100454 -.181705 -.054699 -.283115 -.471510 0.454429 0.301668

    n_3 -.324870 -.052409 0.533502 -.428624 0.124845 -.367410 0.230515 -.208916 0.419537

    n_4 0.389770 -.104202 0.143501 0.644117 -.002649 -.407235 -.032695 -.437329 0.208731

    n_5 -.165954 0.338640 0.207353 0.394213 0.673584 0.353772 0.084423 0.177793 0.204390

    n_6 -.429974 0.152762 -.489183 0.222761 -.358620 -.021224 0.084132 0.051592 0.603378

    n_7 0.469938 -.245918 -.263986 -.124679 0.197221 -.081206 0.649893 0.339844 0.223837

    n_8 0.096979 -.607624 -.094785 -.145799 0.226310 0.331441 -.494236 -.086094 0.422368

    n_9 0.075611 -.194617 0.568567 0.227133 -.537096 0.385772 0.141847 0.328755 0.126714

    The principal component analysis is a method to reduce the dimensionality of

    data. Whether there are two or more variables that explain the same phenomenon, the purpose

    is to find a summary of their information between the variables. Then where we have

    correlated variables we introduce the use of principal components.

    The statistical area Rp, where p is the number of variables, will be reduced in an area Rx with x

  • 8/2/2019 BI Canetto

    17/56

    17

    As we have seen we have 9 theoretical PC; increasing the order of the draw the variance of the

    components decreases and this indicates that lose their importance (significance). It happens

    because the way in which PC are built gives maximum variance to the first, declining

    to the followings. However, each PC will bring a new informative content. We decided to use

    for our analysis all the principal components because they all have weight equal or higher than

    5% and, as already mentioned, our sample is not homogeneous so all the collected information

    is useful for our research.

  • 8/2/2019 BI Canetto

    18/56

    18

    6 The Cluster AnalysisThe next step is to divide the sample in clusters, also called segments. The purpose

    of this analysis is to find groups of people with characteristics sufficiently similar with respectto one or more variables (in this case the segmentation is carried out according to the scale

    questions of our questionnaire).

    These groups should present a very small variance within them (homogeneity of the cluster)

    and, at the same time, a significant variance among them (the clusters must be as diverse as

    possible so we can give them a meaning).

    In this way we should create homogenous groups of potential customers, easy to find and

    study, for example, for marketing purposes.

    We have to point out that the numerosity of cluster will be influenced by the purpose of

    research: in the case of marketing and strategic general choices, the number of

    clusters should be low- we are trying to find best concepts for cafes, and making too many

    groups could result to be too various to realize. Vice versa, in the case of micro marketing and

    operational decisions the number of clusters should be very high to be effective.

    The first step of the cluster analysis is the creation of a distance matrix in which the

    observations are the rows and columns while the cells represent the measure of similarity or

    distance for each pair of observations. The distance matrix is a tool which puts in relation the

    observations in a matrix NxN obtaining the distance in relative terms and the distance andsimilarity between the observations.

    Example of distance-matrix

    a b c d e

    a 0 16 1 9 10

    b 16 0 17 25 2

    c 1 17 0 4 9

    d 9 25 4 0 13

    e 10 2 9 13 0

    There are various ways of measuring the distance between the observations, the distance

    between clusters and their similarity. Then these elements are used as selection criteria for

  • 8/2/2019 BI Canetto

    19/56

    19

    deciding whether merge or not to different clusters: the distance measures when the

    observations can be considered different. In a hierarchical clustering the initial distances

    between observations arranged in the matrix are progressively reduced by the method of the

    merger among observations with the shorter distance (minimum distance fusion).

    Under this method, in our example, customers a and c are the most similar (value equal to 1).

    So the customers a and c will merge into a new unit f. This new unit will have as values of the

    distances, respect to other units, the minimum existent values between the two old units.

    Following the example we have:

    f b d e

    f 0 16 4 9

    b 16 0 25 2

    d 9 25 0 13

    e 9 2 13 0

    Now the most similar customers are b and e, so they will merge in a new unit g, and so on until

    it just remain a single variable. Ant this theory leads to Wards Method.

  • 8/2/2019 BI Canetto

    20/56

    20

    7 Wards MethodIn this analysis we will use the Ward's method that allows us to create groups merging

    observations when the distance between the two is minimal. This distance is calculated as thesum of Euclidean distances squared. The distance will then be calculated using the Pythagorean

    Theorem.

    Sum of squared errors= SSE= i j k (Xi,jk Yi,jk )

    Therefore this method will tend to maximize the so-called variance between (or among the

    different clusters) and to minimize the within (i.e. within the cluster):

    VAR (X)= Var Within+ Var Between

    Var within= [ (Xip- Xp)] / Np The sum of the values of the ith unit -

    average value of group

    divided for the whole group

    Var btw= [ (Xp- X)2 Np ] / Ntot The sum of the average value of the group

    (cluster) - total average value) multiplied

    by the whole group] divided

    the whole population

    Below there is the formula we used with SAS to create the dataset Coffee.tree.

    procclusterdata=Coffee.clustermethod=wardouttree=Coffee.tree;var Prin1-Prin9;id id;run;

    Then we utilized this dataset to draw the dendrogram which helped us to find the number ofsignificant clusters for our analysis.

  • 8/2/2019 BI Canetto

    21/56

    21

    7.1 Dendrogram - Graphical RepresentationIn clustering procedure, the dendrogram is used to provide a graphical representation of the

    process of grouping observations. It provides a graphical representation of the relative distance

    to which the statistics units are melted together. The graph is represented in a Cartesian plane

    with axis X as the logical distance of the cluster according to the measure defined and axis Y

    the hierarchical level of aggregation or Fusion Distance.

    The choice of the hierarchical level defines the number of clusters adoptable for the analysis.

    The observations will be aggregated and distributed among them according to their degree of

    proximity, the further are the observations on the axis X, lower is the possibility that they can

    aggregate into the same cluster.

    However, this probability is also function of the level offusion distance accepted, and since it

    is well known that higher is the level of the hierarchy chosen, lower is the number of clustersfound.

    Regarding to our research, we will use the dendogram to identify the relevant number of

    clusters to examine.

    Now it is possible, therefore, to create the dendrogram through SAS. Trying a few different

    numbers of clusters, we stayed at the number of 3. The result of the procedure is the graph

    below which was appropriately cut to identify these 3 different clusters. The cut was made at a

    point r = 0.06, indicating a pretty good level of accuracy. The dendogram could be cut below,

    making higher number of clusters, but to our strategic decisions, as mentioned before, thelower number serves better.

    The procedure in SAS used:

    proctreedata=Coffee.tree nclusters=3out=Coffee.parti3; id id; run;

  • 8/2/2019 BI Canetto

    22/56

    22

    Semi-PartialR-

    Squared

    0.000

    0.025

    0.050

    0.075

    0.100

    id

    16768107175124

    96688889123

    115

    126

    4129

    7897394563142

    74153

    238120

    20131

    9216444760549105

    90151

    157

    2236144

    70113

    4394103

    108

    4653147

    86111

    84132

    1185146

    100

    150

    12118

    136

    116

    127

    1379110

    133

    107

    119

    14139

    140

    55596272101

    2637138

    17569533299358109

    2132616925112

    18125

    4882141

    40134

    102

    64156

    152

    65148

    32773121

    152351243134122

    143

    128

    135

    950137

    104

    30543577155

    72842106

    91768098130

    145

    154

    1981528741114

    665783149

    117

    99

    SAS Dendogram. 3 Clusters

  • 8/2/2019 BI Canetto

    23/56

    23

    8 T-TEST Procedure8.1 Preparation of the datasetFollowing the identification of several clusters we now understand what the characteristicsdistinctive of each group found are.

    To do this we created generic three clusters in which we included all the answers received.

    This allows us to compare the mean scores of each cluster with the average general results and

    understand how and what clusters are different from general comments

    8.2 T-Test for RespondentsTo study the relationship between quantitative variables and the segments that we have

    obtained using the PROC CLUSTER we can use the T-test.

    This test allows us to calculate a value T that is associated with the exact probability of the

    variable being tested, the average calculated for the cluster differs from the average calculated

    on the entire sample only by chance. It follows this more likely, indicated in SAS Pr> | t |, is

    small, the lower the probability that the difference between the means is caused by the random

    effect, and then increased the probability that the variable is instead significant to explain that

    cluster.

    As for the significance level to the target value in literature is 0.05. The null hypothesis is then

    accepted if the probability is less than 0.05, with a 95% assurance.

    The t-value found in the back edge was calculated as follows: t = (c - tot) / sc sc = standard

    error, estimation of variability of the estimator mean.

    Now some words should be said about the variance, having no course available the variance of

    the population, an estimator must be used of the variance. SAS calculates the T test using two

    methods, which differ just for the treatment of variance proceeds from which will then be used

    in the standard error T-test

    The Satterthwaite method is calculating the standard error forn dividing the weighted average

    of the two variances (of the cluster and population). This method does not place the assumptionof equality of variances, and can be applied in all circumstances. The Pooled method differs

    from the previous one obtained the standard error from the arithmetic mean of the two

    variances, and doing so requires equality variances: the result that the latter can only be applied

    in specific circumstances, namely when the result of the equality of variances F-test confirms

    the null hypothesis. Considering then that in the event of equality of variances, the two

    methods produce the same value of T; it seems more efficient to use the Satterthwaite method.

  • 8/2/2019 BI Canetto

    24/56

    24

    Before further calculations, the data should be sorted and merged in order to create general

    cluster 4 to be able to pompare data in T-test:

    procsortdata=Coffee.parti3;by id;

    run;

    data Coffee.compare;

    merge Coffee.cluster

    Coffee.parti3;

    by id;

    run;

    data Coffee.compare1;

    set Coffee.compare;

    cluster=4;

    run;

    data Coffee.compare3;

    set Coffee.compare Coffee.compare1;

    run;

    It is possible now to check in each cluster, which variables can be used to describe the

    specificity, describing in turn the direction and strength of this relationship, if any.

  • 8/2/2019 BI Canetto

    25/56

    25

    8.3 Cluster 1procttestdata=Coffee.compare3;

    var n_1-n_9;

    class cluster;

    where cluster=1 or cluster=4;

    run;

    8.3.1 Features:Method Variances DF t Value Pr > |t|

    1 Satterthwaite Unequal 131.32 3.24 0.0015interior

    2 Satterthwaite Unequal 108.6 1.60 0.1124socializing

    3 Satterthwaite Unequal 90.227 -2.45 0.0161caffeine4 Satterthwaite Unequal 97.436 -0.96 0.3406

    5 Satterthwaite Unequal 108.28 -0.71 0.4784

    6 Satterthwaite Unequal 123.69 5.53

  • 8/2/2019 BI Canetto

    26/56

    26

    It leads us to a conclusion in case opening the cafe, that respondents in Cluster 1 value a lot a

    possibility to smoke, interior of the caf and socializing during the process. While for their

    decision where to drink coffee is not important if there is a choice of Ice Coffee, is the coffee is

    strong or not, or if they have a possibility to take-away.

    8.4 Cluster 2procttestdata=Coffee.compare3;

    var n_1-n_9;

    class cluster;

    where cluster=2 or cluster=4;

    run;

    8.4.1 Features:Method Variances DF t Value Pr > |t|

    1 Satterthwaite Unequal 69.053 -3.96 0.0002interior

    2 Satterthwaite Unequal 85.885 -5.26

  • 8/2/2019 BI Canetto

    27/56

    27

    8.5 Cluster 3procttestdata=Coffee.compare3;

    var n_1-n_9;

    class cluster;

    where cluster=3 or cluster=4;

    run;

    8.5.1 Features:Method Variances DF t Value Pr > |t|

    1 Satterthwaite Unequal 108.85 1.19 0.2367

    2 Satterthwaite Unequal 107.46 2.96 0.0037socializing

    3 Satterthwaite Unequal 104.54 1.02 0.3121

    4 Satterthwaite Unequal 94.269 2.99 0.0036tastes

    5 Satterthwaite Unequal 89.599 1.77 0.0802price

    6 Satterthwaite Unequal 199.78 -10.11 |t|).

    Analyzing the t-value, we find that except the variable Smoking, all the other variables had a

    positive influence on the choice of the members of this group, which means they care about the

    named features during their drinking coffee time. While Smoking here is not important.

    We can see from the results that the variables Tastes, Socializing and Take-Away have the

    biggest effect, slightly lower effect- Dessert. The less effective one was the Price among these

    five variables. It is clear that we are dealing with people who like to have Coffee of different

    tastes, taking it with dessert and they associated with socializing with other people.

  • 8/2/2019 BI Canetto

    28/56

    28

    9 Proc Freq Procedure (Chi Square Test)After determining three clusters and their particular characteristics, it is important to compare

    each cluster with qualitative characteristics in order to know more about each cluster.Moreover we can compare two qualitative variables with each other, to understand better our

    sample. For this calculation we will use PROC FREQ procedure and CHI SQUARE (Chisq)

    test.

    Chisqprovides chi-square tests of independence of each stratum and computes measures of

    association. The chi-square test is used when you have one variable/group (cluster) and

    compare it with two or more values (sex, country, age, etc.). The observed counts of numbers

    of observations in each category are compared with the expected counts, which are calculated

    using some kind of theoretical expectation.

    Firstly the null hypothesis is that variables are independent with each other (cluster and

    country, age, etc.), opposite hypothesis is that variables are not independent- correlate with

    each other. Analyzing each frequency, the statistical null hypothesis is that the number of

    observations in each category is equal to that predicted, and the alternative hypothesis is that

    the observed numbers are different from the expected. The test will let us to confirm or reject

    major hypothesis, that the clusters and chosen variable are independent, furthermore, compare

    frequencies of each group in each cluster.

    The test statistic is calculated by taking an observed number (O), subtracting the expected

    number (E), and then squaring this difference. The larger the deviation from the null

    hypothesis, the larger the difference between observed and expected is. Squaring the

    differences makes them all positive. Each difference is divided by the expected number, and

    these standardized differences are summed.

    The shape of the chi-square distribution depends on the number of degrees of freedom. For an

    extrinsic null hypothesis, the number of degrees of freedom is simply the number of values of

    the variable, minus one. The degrees of freedom in a test of where there are more than one

    nominal variable, the degree of freedom is equal to (number of rows)1 (number of

    columns)1; in our case 43 table, there are (41)(31)=6 degrees of freedom.

    In practice, the main hypothesis for evaluating each variable, comparing it with cluster is:

    H0 = variables are independent:

    H1 = variables are not independent.

    http://udel.edu/~mcdonald/statvartypes.html#nominalhttp://udel.edu/~mcdonald/stathyptesting.html#nullhttp://udel.edu/~mcdonald/stathyptesting.html#nullhttp://udel.edu/~mcdonald/statvartypes.html#nominal
  • 8/2/2019 BI Canetto

    29/56

    29

    We say that variables are independent, and we confirm hypothesis H0, when CHI SQUARE

    probability p > 0.05 confidence level (we choose 95% confidence level as a default), otherwise

    we reject H0 and take H1.

    9.1 Cluster x VariableIn this part of the analysis we will compare each cluster with all qualitative characteristics,

    firstly all from generic questions, than questions m_1-m_4. We will not compare cluster just

    with variable occupation, as it does not give a lot of information, knowing that the majority

    of respondents are students.

    9.1.1 Cluster x CountryOur biggest difference among people, who participated in the survey, is country, as they have

    different culture and habits. Firstly we will compare each cluster with the country, where they

    live.

    The program we use is SAS is:

    procfreqdata=Coffee.compare3;

    table cluster*country / allexpected;

    FrequencyExpectedPercentRow PctCol Pct Italy Lithuania Palestine Total

    1 15 24 18 57 18.516 22.873 15.611 4.78 7.64 5.73 18.15 26.32 42.11 31.58 14.71 19.05 20.93

    2 23 13 10 46 14.943 18.459 12.599 7.32 4.14 3.18 14.65 50.00 28.26 21.74 22.55 10.32 11.63

    3 13 26 15 54 17.541 21.669 14.79 4.14 8.28 4.78 17.20 24.07 48.15 27.78

    12.75 20.63 17.44

    4 51 63 43 157 51 63 43 16.24 20.06 13.69 50.00 32.48 40.13 27.39 50.00 50.00 50.00

    Total 102 126 86 31432.48 40.13 27.39 100

  • 8/2/2019 BI Canetto

    30/56

    30

    Statistics for Table of CLUSTER by country

    Statistic DF Value Prob

    Chi-Square 6 9.6280 0.1412

    Likelihood Ratio Chi-Square 6 9.3295 0.1559Mantel-Haenszel Chi-Square 1 0.0052 0.9428

    Phi Coefficient 0.1751

    Contingency Coefficient 0.1725

    Cramer's V 0.1238

    We see that Chi-Square prob- 0.1412> 0,05 (Chi-square value- 9,626, with 6 degree of freedom).

    We confirm the zero hypotheses H0 and state that cluster and variable country are independent.

    On the other hand, analyzing the fields one by one, firstly we see that probability of independence

    is 14%, which is not very high and Chi-Square value is more than 9, not extremely low, there might

    be some relationships. We can find some differences in each cluster between expected value and

    the real frequency:

    Cluster 2: has frequency of 23 Italians, instead of expected 15, which is 50% instead of32%, which means that the second cluster has features more common to Italians. Moreover,

    the same cluster has a little bit lower than expected frequency of Lithuanians, which is 28%

    instead of 40%. So we see that these cluster characteristics are not so common to

    Lithuanians. Palestinians do not have significant difference between expected and real

    frequency.

    Cluster 3 does not have really significant differences from expected value. A little bitlower frequency than expected there we find of Italians. 24% instead of 32%, and a little bit

    more than expected Lithuanians, 48% instead of 40%.

    All in all, Cluster 1 is common for all three countries. Cluster 2 is more suitable for Italians,

    Palestinians also do not reject it. Cluster 3 reflects more Lithuanian habits, Palestinians do not

    reject it, Italians preferences slightly differ here.

  • 8/2/2019 BI Canetto

    31/56

    31

    9.1.2 Cluster x GenderSecond characteristic, by which we will compare clusters is gender (sex). Here we use a SAS

    program:

    procfreqdata=Coffee.compare3;table cluster*sex / allexpected

    run;

    FrequencyExpectedPercentRow PctCol Pct Female Male Total

    1 32 25 57 32.312 24.688 10.19 7.96 18.15 56.14 43.86 17.98 18.38

    2 25 21 46 26.076 19.924 7.96 6.69 14.65 54.35 45.65 14.04 15.44

    3 32 22 54 30.611 23.389 10.19 7.01 17.20 59.26 40.74 17.98 16.18

    4 89 68 157 89 68 28.34 21.66 50.00 56.69 43.31

    50.00 50.00

    Total 178 136 31456.69 43.31 100.00

    Statistics for Table of CLUSTER by sex

    Statistic DF Value Prob

    Chi-Square 3 0.2550 0.9683

    Likelihood Ratio Chi-Square 3 0.2553 0.9682

    Mantel-Haenszel Chi-Square 1 0.0272 0.8689

    Phi Coefficient 0.0285

    Contingency Coefficient 0.0285

    Cramer's V 0.0285

    These results indicate that there is no statistically significant relationship between cluster and

    gender (chi-square with 3 degree of freedom = 0.2550, p = 0.9683). The probability of no

    correlation is 96% and the Chi-square value is very low; we clearly see that there is no any

    correlation between cluster and gender, all three clusters features are acceptable for both

    genders.

  • 8/2/2019 BI Canetto

    32/56

    32

    9.1.3 Cluster x AgeAnother variable to check is age, counted with the program:

    procfreqdata=Coffee.compare3;

    table cluster*age / allexpected;

    run;

    Frequency

    Expected

    Percent

    Row Pct

    Col Pct14-19 20-24 25-29 30-39 =>40 Total

    1 1 26 21 5 4 57 1.0892 31.223 18.153 3.9936 2.5414

    0.32 8.28 6.69 1.59 1.27 18.15

    1.75 45.61 36.84 8.77 7.02

    16.67 15.12 21.00 22.73 28.57

    2 1 30 10 3 2 46 0.879 25.197 14.65 3.2229 2.051

    0.32 9.55 3.18 0.96 0.64 14.65

    2.17 65.22 21.74 6.52 4.35

    16.67 17.44 10.00 13.64 14.29

    3 1 30 19 3 1 54 1.0318 29.58 17.197 3.7834 2.4076

    0.32 9.55 6.05 0.96 0.32 17.20

    1.85 55.56 35.19 5.56 1.85

    16.67 17.44 19.00 13.64 7.14

    4 3 86 50 11 7 157

    3 86 50 11 7 0.96 27.39 15.92 3.50 2.23 50.00

    1.91 54.78 31.85 7.01 4.46

    50.00 50.00 50.00 50.00 50.00

    Total 6 172 100 22 14 314

    1.91 54.78 31.85 7.01 4.46 100.00

    Statistics for Table of CLUSTER by age

    Statistic DF Value Prob

    Chi-Square 12 6.0238 0.9149

    Likelihood Ratio Chi-Square 12 6.2856 0.9010Mantel-Haenszel Chi-Square 1 0.5908 0.4421

    Phi Coefficient 0.1385

    Contingency Coefficient 0.1372

    Cramer's V 0.0gg800

    WARNING: 50% of the cells have expected counts less

    than 5. Chi-Square may not be a valid test.

  • 8/2/2019 BI Canetto

    33/56

    33

    These results shuffler that there is no statistically significant relationship between cluster

    attended and gender (chi-square with 12 degree of freedom = 6,0238, p = 0.9149).

    On the other hand, SAS suggest that 50% of the cells have expected counts less than 5 and

    Chi-Square may not be a valid test. For further calculations The Fishers test should be used.

    The Fisher's exact test is used when you want to conduct a chi-square test, but one or more of

    your cells has an expected frequency of five or less. Remember that the chi-square test

    assumes that each cell has an expected frequency of five or more, but the Fisher's exact test has

    no such assumption and can be used regardless of how small the expected frequency is.

    We could use the program as follows:

    proc freq data = Coffee.comapre3;tables cluster*age / fisher;

    run;

    On the other hand we clearly see that the majority of our respondents are 20-29 years old

    (82%), so we focus on young people overall and further calculations are not necessary.

  • 8/2/2019 BI Canetto

    34/56

    34

    9.1.4 Cluster x SmokerOpening a cafe it is important to know, how the respondents relate with smoking, in order to

    prepare places for smokers or not invest in it. Firstly we will determine the frequencies of

    smokers in each cluster.

    procfreqdata=Coffee.compare3;

    table cluster*smoker/ allexpected;

    run;

    Frequency

    Expected

    Percent

    Row Pct

    Col Pct No Yes Total

    1 19 38 57

    33.764 23.236

    6.05 12.10 18.15 33.33 66.67

    10.22 29.69

    2 25 21 46

    27.248 18.752

    7.96 6.69 14.65

    54.35 45.65

    13.44 16.41

    3 49 5 54

    31.987 22.013

    15.61 1.59 17.20

    90.74 9.26

    26.34 3.91

    4 93 64 157

    93 64

    29.62 20.38 50.00

    59.24 40.76

    50.00 50.00

    Total 186 128 314

    59.24 40.76 100.00

    Statistics for Table of CLUSTER by smoker

    Statistic DF Value Prob

    Chi-Square 3 38.4895

  • 8/2/2019 BI Canetto

    35/56

    35

    This time results show that that there is statistically significant relationship between cluster and

    smoking habits (chi-square with 3 degree of freedom = 38.49, p =

  • 8/2/2019 BI Canetto

    36/56

    36

    9.1.5 Cluster x The Time of the DayIn this part we will compare, if there is any link between time of the day to drink coffee (m_1)

    and cluster. In this way, for example, the opening hours of cafe could be optimized.

    We use the program:procfreqdata=Coffee.compare3;

    table cluster*m_1/ allexpected;

    run;

    Frequency

    Expected

    Percent

    Row Pct Col Pct After Afternoon Does not Evening Morning Usually Total

    Meals matter when I

    The time meet for

    this purpose

    1 7 1 23 0 18 8 57 6.172 2.9045 20.694 0.3631 20.331 6.535

    2.23 0.32 7.32 0.00 5.73 2.55 18.15

    12.28 1.75 40.35 0.00 31.58 14.04

    20.59 6.25 20.18 0.00 16.07 22.22

    2 7 2 14 1 19 3 46

    4.9809 2.3439 16.701 0.293 16.408 5.2739

    2.23 0.64 4.46 0.32 6.05 0.96 14.65

    15.22 4.35 30.43 2.17 41.30 6.52

    20.59 12.50 12.28 50.00 16.96 8.33

    3 3 5 20 0 19 7 54

    5.8471 2.7516 19.605 0.3439 19.261 6.1911

    0.96 1.59 6.37 0.00 6.05 2.23 17.20

    5.56 9.26 37.04 0.00 35.19 12.96

    8.82 31.25 17.54 0.00 16.96 19.44

    4 17 8 57 1 56 18 157

    17 8 57 1 56 18

    5.41 2.55 18.15 0.32 17.83 5.73 50.00

    10.83 5.10 36.31 0.64 35.67 11.46

    50.00 50.00 50.00 50.00 50.00 50.00

    Total 34 16 114 2 112 36 314

    10.83 5.10 36.31 0.64 35.67 11.46 100.00

    Statistics for Table of CLUSTER by m_1

    Statistic DF Value ProbChi-Square 15 10.6619 0.7762

    Likelihood Ratio Chi-Square 15 11.1431 0.7424

    Mantel-Haenszel Chi-Square 1 0.0290 0.8648

    PhiCoefficient 0.1843

    Contingency Coefficient 0.1812

    Cramer's V 0.1064

    WARNING: 33% of the cells have expected counts less

    than 5. Chi-Square may not be a valid test.

  • 8/2/2019 BI Canetto

    37/56

    37

    These results show that there is no statistically significant relationship between cluster and

    coffee drinking time (chi-square with fifteenth degree of freedom = 10.66, p = 0.7762). On the

    other hand, SAS suggest that 33% of the cells have expected counts less than 5 and Chi-Square

    may not be a valid test. For further calculations The Fishers test should be used.

    On the other hand, we also notice that majority of respondents (36%) drink coffee in the

    morning, or say, that time is not important (36%) , or that they do it when they meet other

    people (11%). Just one respondent mark evening, so overall we can say that respondents are

    used to drink coffee all the times, except evening. All the clusters have similar trend, so further

    calculations are not necessary for our conclusion.

    9.1.6 Cluster x The Type of CoffeeAnother variable to analyze is type of coffee (m_2) preferred by each cluster.

    procfreqdata=Coffee.compare3;

    table cluster*m_2/ allexpected;

    run;

    Frequency ,

    Expected

    Percent

    Row Pct

    Col Pct AmericanCaffee LCappucciEspressoOther, n Total

    o atte no ot tradi

    tional t

    ypes

    1 1 13 8 32 3 57

    2.5414 13.796 11.981 25.777 2.9045

    0.32 4.14 2.55 10.19 0.96 18.15

    1.75 22.81 14.04 56.14 5.26

    7.14 17.11 12.12 22.54 18.75

    2 2 6 10 25 3 46

    2.051 11.134 9.6688 20.803 2.3439

    0.64 1.91 3.18 7.96 0.96 14.65

    4.35 13.04 21.74 54.35 6.52

    14.29 7.89 15.15 17.61 18.75

    3 4 19 15 14 2 54

    2.4076 13.07 11.35 24.42 2.7516

    1.27 6.05 4.78 4.46 0.64 17.20

    7.41 35.19 27.78 25.93 3.70

    28.57 25.00 22.73 9.86 12.50

    4 7 38 33 71 8 157

    7 38 33 71 8

    2.23 12.10 10.51 22.61 2.55 50.00

    4.46 24.20 21.02 45.22 5.10

    50.00 50.00 50.00 50.00 50.00

    Total 14 76 66 142 16 314

    m4.46 24.20 21.02 45.22 5.10 100.00

  • 8/2/2019 BI Canetto

    38/56

    38

    Statistics for Table of CLUSTER by m_2

    Statistic DF Value Prob

    Chi-Square 12 16.7882 0.1577Likelihood Ratio Chi-Square 12 17.7738 0.1227

    Mantel-Haenszel Chi-Square 1 2.2114 0.1370

    Phi Coefficient 0.2312

    Contingency Coefficient 0.2253

    Cramer's V 0.1335

    WARNING: 30% of the cells have expected counts less

    than 5. Chi-Square may not be a valid test.

    The results show that there is no statistically significant relationship between cluster and type

    of coffee preferred (chi-square with 12 degree of freedom = 16,79, p = 0.1577). Again, SAS

    suggests that 30% of the cells have expected counts less than 5 and Chi-Square may not be a

    valid test. For further calculations The Fishers test should be used.

    On the other hand we see that Espresso is the of course the most popular type of coffee (45%),

    respondent also choose Caffee Latte (24%) and Cappuccino (21%), other choices are not so

    significant.

    Cluster 1, analyzing just the frequencies is fonder of Espresso, which is 56% instead ofexpected 45%. They also like Caffe Latte (23%), but do not choose so much

    Cappuccino (14% instead of 21%).

    Cluster 2 respondents are also Espresso drinkers: 54% after expected 45%. Secondchoice is Cappucino (22%), but Caffe Latte is not so popular here ( 13%, instead of

    24%).

    Cluster 3 is of definitely Caffee Latte drinkers (35% instead of 24%). Their secondchoice is Cappuccino (27% instead of 21%), third- Espresso. On the other hand

    Espresso frequency is quite lower than expected (26% instead of 45%)

  • 8/2/2019 BI Canetto

    39/56

    39

    9.1.7 Cluster x Times Per DayIn his part we will evaluate each cluster comparing with times per day (m_3) respondents drink

    coffee.

    procfreqdata=Coffee.compare3;table cluster*m_3/ allexpected;

    run;

    Frequency

    Expected

    Percent

    Row Pct Col Pct 1 or less 2, 3, >3 , Total

    1 13 20 11 13 57

    17.427 21.783 8.7134 9.0764

    4.14 6.37 3.50 4.14 18.15 22.81 35.09 19.30 22.81

    13.54 16.67 22.92 26.00

    2 12 22 6 6 46

    14.064 17.58 7.0318 7.3248

    3.82 7.01 1.91 1.91 14.65

    26.09 47.83 13.04 13.04

    12.50 18.33 12.50 12.00

    3 23 18 7 6 54

    16.51 20.637 8.2548 8.5987

    7.32 5.73 2.23 1.91 17.20

    42.59 33.33 12.96 11.11

    23.96 15.00 14.58 12.00

    4 48 60 24 25 157

    48 60 24 25

    15.29 19.11 7.64 7.96 50.00

    30.57 38.22 15.29 15.92

    50.00 50.00 50.00 50.00

    Total 96 120 48 50 314

    30.57 38.22 15.29 15.92 100.00

    Statistics for Table of CLUSTER by m_3

    Statistic DF Value Prob

    Chi-Square 9 9.2367 0.4157

    Likelihood Ratio Chi-Square 9 8.8972 0.4468

    Mantel-Haenszel Chi-Square 1 1.6380 0.2006

    Phi Coefficient 0.1715

    Contingency Coefficient 0.1690

    Cramer's V 0.0990

  • 8/2/2019 BI Canetto

    40/56

    40

    These results confirm zero hypothesis H0 that there is no statistically significant relationship

    between cluster and times a day coffee is used (chi-square with 9 degree of freedom = 9.2367,

    p = 0.4157). The probability of no correlation is 41%. There are no very significant differences

    analyzing one by one clusters and variable answers. The only notice could be made, that in

    Cluster 3 respondents choose more often than expected drinking coffee less than once a day.

    We have 43% instead of expected 31% frequency. Knowing the characteristics of clusters, we

    can say, that probably respondents relate coffee with socializing, dessert, not every day routine.

    Analyzing frequencies in the Cluster 1 and Cluster 2 respondents usually choose coffee twice a

    day. Also in cluster 3 twice a day choice is significant.

    9.1.8 Cluster x Way of Drinking CoffeeThis time we will compare cluster and the way of drinking coffee (m_4), using the program:

    procfreqdata=Coffee.compare3;table cluster*m_4/ allexpected;

    run;

    Frequency

    Expected

    Percent

    Row Pct

    Col Pct Bar Take Sitting in Total

    Away Cafe

    1 11 11 35 57

    13.433 11.618 31.949

    3.50 3.50 11.15 18.15

    19.30 19.30 61.40

    14.86 17.19 19.89

    2 18 9 19 46

    10.841 9.3758 25.783

    5.73 2.87 6.05 14.65

    39.13 19.57 41.30

    24.32 14.06 10.80

    3 8 12 34 54

    12.726 11.006 30.268

    2.55 3.82 10.83 17.20

    14.81 22.22 62.96

    10.81 18.75 19.32

    4 37 32 88 157

    37 32 88

    11.78 10.19 28.03 50.00

    23.57 20.38 56.05

    50.00 50.00 50.00

    Total 74 64 176 314

    23.57 20.38 56.05 100.00

  • 8/2/2019 BI Canetto

    41/56

    41

    Statistics for Table of CLUSTER by m_4

    Statistic DF Value Prob

    Chi-Square 6 9.5977 0.1426

    Likelihood Ratio Chi-Square 6 9.2570 0.1596Mantel-Haenszel Chi-Square 1 0.0296 0.8633

    Phi Coefficient 0.1748

    Contingency Coefficient 0.1722

    Cramer's V 0.1236

    The results confirm zero hypothesis H0 that there is no statistically significant relationship

    between cluster and way to drink coffee is used (chi-square with 6 degree of freedom = 9.5977,

    p = 0.1426). The probability of no correlation is not so strong- 14%.

    Analyzing frequencies one by one, we notice some different values than expected in Cluster 2,

    where respondent choose more often than usual to drink coffee fast next to the bar (39%

    instead of expected 24%), and less than usual taking a cup of coffee without a hurry in a caf

    (41% instead of 56%)

    Overall, looking at all the sample, sitting in a caf, taking the time is most popular way to drink

    coffee (56%) take-away (20%) and fast coffee in a bar (24%) have more or less the same

    popularity

  • 8/2/2019 BI Canetto

    42/56

    42

    9.2 Country X VariableIn this part we will compare our variable Country, with other qualitative variables, related to

    personal coffee drinking habit (questions m_1-m_4). As mentioned before, for marketing

    decisions variable country is important to analyze, because of cultural differences.

    9.2.1 Country x The Time of the DayFirstly we compare variable country, with time of the day to drink coffee (m_1), using the

    program:

    procfreqdata=Coffee.compare3;

    table country*m_1/ allexpected;

    run;

    Frequency

    Expected

    Percent

    Row Pct

    Col Pct After meAfternooDoes notEvening Morning Usually Total

    als n matter when I m

    the time eet othe

    r people

    for thi

    s purpos

    e

    Italy 20 4 34 2 42 0 102

    11.045 5.1975 37.032 0.6497 36.382 11.694

    6.37 1.27 10.83 0.64 13.38 0.00 32.48

    19.61 3.92 33.33 1.96 41 in 0.00

    58.82 25.00 29.82 100.00 37.50 0.00

    Lithuanian 10 12 48 0 40 16 126 13.643 6.4204 45.745 0.8025 44.943 14.446

    3.18 3.82 15.29 0.00 12.74 5.10 40.13

    7.94 9.52 38.10 0.00 31.75 12.70

    29.41 75.00 42.11 0.00 35.71 44.44

    Palestine 4 0 32 0 30 20 86

    9.3121 4.3822 31.223 0.5478 30.675 9.8599

    1.27 0.00 10.19 0.00 9.55 6.37 27.39

    4.65 0.00 37.21 0.00 34.88 23.26

    11.76 0.00 28.07 0.00 26.79 55.56

    Total 34 16 114 2 112 36 314

    10.83 5.10 36.31 0.64 35.67 11.46 100.00

  • 8/2/2019 BI Canetto

    43/56

    43

    Statistics for Table of country by m_1

    Statistic DF Value Prob

    Chi-Square 10 49.0229

  • 8/2/2019 BI Canetto

    44/56

    44

    9.2.2 Country x Type of CoffeeAnother variable to compare by countries is the type of coffee (m_2).

    procfreqdata=Coffee.compare3;table country*m_2/ allexpected;

    run;

    Frequency

    Expected

    Percent

    Row Pct

    Col Pct AmericanCaffee LCappucciEspressoOther, n Total

    o atte no ot tradi

    tional t

    ypes

    Italy 6 6 8 82 0 102

    4.5478 24.688 21.439 46.127 5.1975

    1.91 1.91 2.55 26.11 0.00 32.48

    5.88 5.88 7.84 80.39 0.00

    42.86 7.89 12.12 57.75 0.00

    Lithuanian 6 64 22 30 4 126

    5.6178 30.497 26.484 56.981 6.4204

    1.91 20.38 7.01 9.55 1.27 40.13

    4.76 50.79 17.46 23.81 3.17

    42.86 84.21 33.33 21.13 25.00

    Palestine 2 6 36 30 12 86

    3.8344 20.815 18.076 38.892 4.3822

    0.64 1.91 11.46 9.55 3.82 27.39

    2.33 6.98 41.86 34.88 13.95

    14.29 7.89 54.55 21.13 75.00

    Total 14 76 66 142 16 314

    4.46 24.20 21.02 45.22 5.10 100.00

    Statistics for Table of country by m_2

    Statistic DF Value Prob

    Chi-Square 8 151.8787

  • 8/2/2019 BI Canetto

    45/56

    45

    There definitely is statistically significant relationship between country and type of the coffee

    (chi-square with eight degree of freedom = 151.87, p =

  • 8/2/2019 BI Canetto

    46/56

    46

    Palestine 30 24 18 14 86

    26.293 32.866 13.146 13.694

    9.55 7.64 5.73 4.46 27.39

    34.88 27.91 20.93 16.28

    31.25 20.00 37.50 28.00

    Total 96 120 48 50 314

    30.57 38.22 15.29 15.92 100.00

    Statistics for Table of country by m_3

    Statistic DF Value Prob

    Chi-Square 6 29.5616

  • 8/2/2019 BI Canetto

    47/56

    47

    9.2.4 Country X The Way of DrinkingFinally we will compare country with the way of drinking coffee (m_4).

    procfreqdata=Coffee.compare3;

    table country*m_4/ allexpected;

    run;

    Frequency

    Expected

    Percent

    Row Pct

    Col Pct Bar Take Sitting in

    Away Cafe

    Italy 66 12 24 102

    24.038 20.79 57.172

    21.02 3.82 7.64 32.48

    64.71 11.76 23.53

    89.19 18.75 13.64

    Lithuanian 4 36 86 126

    29.694 25.682 70.624

    1.27 11.46 27.39 40.13

    3.17 28.57 68.25

    5.41 56.25 48.86

    Palestine 4 16 66 86

    20.268 17.529 48.204

    1.27 5.10 21.02 27.39

    4.65 18.60 76.74

    5.41 25.00 37.50

    Total 74 64 176 314

    23.57 20.38 56.05 100.00

    Statistics for Table of country by m_4

    Statistic DF Value Prob

    Chi-Square 4 145.6996

  • 8/2/2019 BI Canetto

    48/56

    48

    of 56%), moreover take-away culture is not so common for this culture (12% instead of

    20%).

    Lithuanians opposite, more than expected like to take their time to drink a cup of coffee(68% instead of 56%). If not this choice, saving time Lithuanians take-away their cupof coffee (29%). They do not have a habit to drink coffee in a hurry just next to the bar

    (3% instead of expected 24 %).

    Palestinians are more similar to Lithuanians, than Italians. Firstly they prefer takingtheir time to have a cup of coffee (76% instead of expected 56%). 19% of Palestinians

    like take-away coffee. On the other hand just a few of them (5% instead of expected

    24%) are for taking fast coffee next to the bar.

  • 8/2/2019 BI Canetto

    49/56

    49

    9.2.5 The Most Significant Features of The CountriesAfter comparing three countries with different variables, we can recognize some obvious

    features and differences between Italy, Lithuania and Palestine.

    9.2.5.1ItalyItalians are people with specific coffee drinking traditions. Firstly they usually drink coffee

    twice: in the morning and after the meals, or any other time of the day. Italians prefer taking

    Espresso and more likely fast, next to the bar. These would be the most significant features of

    Italian respondents.

    9.2.5.2LithuaniaLithuanians drink coffee once or twice a day, usually morning and then any other time of the

    day, often with the purpose of socializing. Lithuanians, despite most popular coffee- Espresso,

    moreover they are real Coffee Latte lovers. Moreover they enjoy sitting in the caf and takingtheir time.

    9.2.5.3PalestinePalestinians are more similar to Lithuania, than Italy. Palestinians usually drink coffee in the

    morning, and the not related to the timetable, for the purpose of meeting people and

    socializing. Palestinians drink coffee or really rarely, like less than once a day, or three or more

    times. Palestinians have a strong preference for Cappuccino; moreover, as everywhere,

    Espresso is also important. Palestinians more than other people like different, not traditional

    tastes of coffee. This nation the same as Lithuanians, have a strong preference for taking theirtime in a caf for coffee.

  • 8/2/2019 BI Canetto

    50/56

    50

    10 Strategic DecisionsFinally we came to the conclusion, where we will determine the different groups with theirparticular characteristics and habits, which tell us what kind of caf would be popular for each

    group. Firstly we will determine the most significant features of each country. Later we will

    add these features making strategic marketing decision what kind of caf to open in ach area.

    10.1 Cluster 1- Sophisticated Coffee and Cigarettes

    As we saw above, cluster 1 is a group of people, who enjoy smoking, good interior andatmosphere, and socializing during their coffee drinking process. Respondents with mentioned

    characteristics are spread in all three countries, without any significant differences from

    expected frequencies. This cluster has a significant feature- most of the respondents are

    smokers. Moreover, as common for all the sample, people in the cluster mostly choose taking

    their time to drink coffee.

    For this group of respondents fashionable cafes would be opened. The biggest attention should

    be paid for creating of interior and cozy atmosphere, inviting to stay inside longer. Even it is

    forbidden to smoke inside; the smoking area should be available, with heaters for winter(especially in Lithuania). The cafes should be situated in strategically comfortable location for

    meetings. While Italians tend to drink coffee fast, next to the bar, the caf in Italy should have

    less places for sitting and more attention should be paid for attractive bar to take fast coffee on

    the way, or with cigarette outside. In other countries a bar is not necessary; more attention must

    be paid to creating enough places to sit. The menu in a sophisticated place should include

    traditional types of coffee.

  • 8/2/2019 BI Canetto

    51/56

    51

    10.2 Cluster 2- Fast Coffee

    Cluster 2 describes people, whose preference is good quality strong coffee, mostly Espresso,

    moreover Ice- Coffee (we assume that in summer season). Here people do not like spend a lotof time for the process, they prefer fast coffee. Moreover Cluster 2 has a higher frequency of

    Italians, than expected.

    The caf to satisfy the needs of the group described by Cluster 2 should be simple coffee bar in

    convenient locations: next to offices, city center shopping places, universities, lunch

    restaurants. Good quality strong coffee and ice coffee choice are summer is essential features.

    No investment should be made in extended menus with not traditional tastes of coffee.

    Knowing the features of coffee drinking habits in Italy and the fact, that cluster has a

    significant number of Italians, firstly these coffee bars should be opened in Italy. Knowing thatthere is a big competition of similar concept places in this country, we would compete with

    good quality coffee, convenient locations and fast service or the possibility of self service to

    make the process less time consuming.

  • 8/2/2019 BI Canetto

    52/56

    52

    10.3 Cluster 3- Sweet Break or Take-Away

    Cluster 3 describes people, who like meeting other people, and while socializing, having a cup

    of coffee, with different tastes, or a dessert next to eat. Their preferred type of coffee in Caffe

    Latte and probably its variations. These respondents also choose Take-Away coffee. This way

    of coffee relating with socializing and dessert is more common to Lithuanians.

    Here Starbucks style coffee place should be opened. The caf would offer an extended menu

    of different tastes. Most of the attention should be paid to Caffee Latte with different syrups.

    An attractive dessert menu should be available. The place should be cozy to spend some time

    there, but simple, keeping the prices low.

    Knowing the features of three countries, and the fact that this cluster is more common to

    Lithuanians, firstly we open this concept cafes in Lithuania. Also Palestine shouldnt beforgotten, as they express their preference for different types of coffee, drinking it many times

    a day and taking their time for the process. For the moment we should not invest in opening

    such lace in Italy, as there people have slightly different habits.

  • 8/2/2019 BI Canetto

    53/56

    53

    11 AppendixAppendix 1. Questionnaire.

    Coffee drinking habits in Palestine, Lithuania and Italy

    We are making a research about the preferences of cafe and coffee drinking habits in three different

    cultures, with the purpose to find the best concept of cafe in each location. So the questionnaire is

    oriented to your coffee drinking habits in cafes, not at home (if, for example you drink coffee every

    morning at home, and later in a cafe with other people, please relate the answers more with the second).

    Please keep that in mind answering the questions. If you do not drink coffee, do not fill the

    questionnaire. We are kindly asking to fill the questionnaire just if you are originally from Lithuania,

    Italy or Palestine. For each multiple choice question choose just one best answer. For scale typequestions, evaluate the argument or answer the question, when 1 is the most negative answer, 10 is the

    most positive. The survey is absolutely anonymous.

    1. Choose one favourite type of coffee, which you usually drink, from the list below: * Espresso Americano Cappuccino Caffee Latte Other, not traditional types

    2. When do you usually drink coffee? * Morning Afternoon Evening After meals Does not matter the time Usually when I meet other people for this purpose

    3. How many times a day do you usually drink coffee? * 1 or less 2 3

  • 8/2/2019 BI Canetto

    54/56

  • 8/2/2019 BI Canetto

    55/56

  • 8/2/2019 BI Canetto

    56/56

    Working person Unemployed Other

    18.Are you smoker? * No Yes